1. Neuroscience
Download icon

Asymmetric ON-OFF processing of visual motion cancels variability induced by the structure of natural scenes

  1. Juyue Chen
  2. Holly B Mandel
  3. James E Fitzgerald  Is a corresponding author
  4. Damon A Clark  Is a corresponding author
  1. Yale University, United States
  2. Howard Hughes Medical Institute, United States
Research Article
  • Cited 2
  • Views 1,071
  • Annotations
Cite this article as: eLife 2019;8:e47579 doi: 10.7554/eLife.47579

Abstract

Animals detect motion using a variety of visual cues that reflect regularities in the natural world. Experiments in animals across phyla have shown that motion percepts incorporate both pairwise and triplet spatiotemporal correlations that could theoretically benefit motion computation. However, it remains unclear how visual systems assemble these cues to build accurate motion estimates. Here, we used systematic behavioral measurements of fruit fly motion perception to show how flies combine local pairwise and triplet correlations to reduce variability in motion estimates across natural scenes. By generating synthetic images with statistics controlled by maximum entropy distributions, we show that the triplet correlations are useful only when images have light-dark asymmetries that mimic natural ones. This suggests that asymmetric ON-OFF processing is tuned to the particular statistics of natural scenes. Since all animals encounter the world’s light-dark asymmetries, many visual systems are likely to use asymmetric ON-OFF processing to improve motion estimation.

Introduction

For any visual system, motion estimation is an important but computationally challenging task. To accurately extract motion signals from complex natural inputs, visual systems should take advantage of all useful information. One source of information lies in the stable statistics of the visual input, that is, in the regularities of natural scenes (Geisler, 2008). A strong version of this hypothesis is that visual systems are tuned, through evolution and experience, to the statistics of natural environments (Chichilnisky and Kalmar, 2002; Olshausen and Field, 1996; Simoncelli and Olshausen, 2001; Srinivasan et al., 1982). However, it remains unclear how visual systems use the statistics of natural scenes and the motion signals in them to aid in motion estimation (Salisbury and Palmer, 2016; Sinha et al., 2018).

Motion computation has long been understood algorithmically as selective responses to specific spatiotemporal correlations (Fitzgerald et al., 2011; Poggio and Reichardt, 1973; Potters and Bialek, 1994). For example, canonical models propose that animals extract motion from visual signals by detecting pairwise spatiotemporal correlations (Adelson and Bergen, 1985; Hassenstein and Reichardt, 1956). Higher order correlations could also contribute to motion computation, and Bayes optimal visual motion estimators can be written as a sum of terms specialized for detecting different correlation types (Potters and Bialek, 1994Fitzgerald et al., 2011). This mathematical result follows from a Volterra series expansion, which provides a general and systematic way to represent nonlinear computational systems. Higher order correlations are also empirically relevant. For example, triplet spatiotemporal correlations are the next lowest-order terms after pairwise correlations, and both humans and flies perceive motion in ‘glider’ stimuli that isolate triplet spatiotemporal correlations (Clark et al., 2014; Hu and Victor, 2010). The sensitivity to triplet spatiotemporal correlations shows that motion perception incorporates cues neglected by canonical motion detectors.

Interestingly, perceptual sensitivities to triplet spatiotemporal correlations prove that visual systems must consider the polarity of contrast when computing motion. This is because triplet correlations flip sign when contrast polarities are inverted, which means that the perceptual contribution of triplet correlations to a motion estimator reverses when all input contrasts are inverted. This contrast-polarity dependent motion processing has been hypothesized to be an adaptation to natural scenes, especially to the light-dark asymmetry of the contrast distribution (Clark et al., 2014; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Leonhardt et al., 2016; Nitzany and Victor, 2014). For example, simulated motion detectors that were optimized to estimate motion in natural scenes exhibited contrast-polarity-dependent responses similar to flies (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). These modeling studies suggest that contrast-polarity-dependent responses emerge from performance optimization in natural scenes, but they do not show that real visual systems use their sensitivity to triplet spatiotemporal correlations to improve motion estimates. This limitation arises because previous experimental studies measured sensitivities to only a few triplet correlations (Clark et al., 2014; Leonhardt et al., 2016). However, one cannot assess the utility of individual correlations in isolation (Clark et al., 2014; Nitzany et al., 2016), and naturalistic visual signals contain many spatiotemporal correlations with diverse spatiotemporal structures. Moreover, although previous analyses recognized that some kind of light-dark asymmetry is required for triplet correlation sensitivity to emerge in optimized motion estimators (Clark et al., 2014; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Leonhardt et al., 2016), they did not discover which statistical regularities within natural scenes were sufficient for the observed motion signals to improve accuracy.

Here, we filled these gaps by systematically measuring the nonlinearities in Drosophila visual motion detection and relating them to light-dark asymmetries in natural scenes. We first systematically characterized low-order components of the fly’s motion computation algorithm by modeling its visually evoked turning behavior with a Volterra series expansion (Clark et al., 2011; Clark et al., 2014; Fitzgerald et al., 2011; Marmarelis and McCann, 1973; Poggio and Reichardt, 1973; Salazar-Gatzimas et al., 2016). Through this framework, we extended canonical pairwise (second-order) motion computation models by adding a triplet (third-order) component that accounts for contrast-polarity-dependent motion computation. We evaluated the performance of the inferred algorithm across an ensemble of moving natural images and discovered that the third-order component improves velocity estimates by canceling image-induced variability in the second-order component. Finally, we leveraged maximum entropy distributions to develop a method for generating synthetic images with precisely controlled contrast statistics. This method revealed that the skewness of natural images allows the fly’s sensitivity to triplet spatiotemporal correlations to improve its canonical motion estimates.

Results

The structure of natural scenes induces variability in second-order motion estimates

To evaluate how canonical motion detectors performed with natural scene inputs, we simulated responses of the Hassenstein-Reichardt Correlator (HRC) to rigidly translating natural scenes. The HRC exemplifies canonical motion detectors, which rely exclusively on pairwise spatiotemporal correlations to estimate motion (Adelson and Bergen, 1985; Hassenstein and Reichardt, 1956) (Figure 1A). It can be equivalently written as a motion energy model (Adelson and Bergen, 1985). We used a database of natural, panoramic photographs to create naturalistic motion stimuli (Meyer et al., 2014). In particular, we first converted the photographs’ luminance signals into local contrast signals (Figure 1B, Figure 1—figure supplement 1). We then rigidly translated these natural images at various horizontal velocities to simulate full-field motion signals (Badwan et al., 2019; Dror et al., 2001; Fitzgerald and Clark, 2015; Leonhardt et al., 2016). This rigid translation of images mimics the motion produced by an animal’s pure rotation, during which visual objects all move at the same rotational velocity and occlusion does not change over time. Real motion through an environment generates more complex signals than this, but rigid translations are straightforward to compute and rotational visual stimuli are known to induce the rotational optomotor response that we focus on in this manuscript.

Figure 1 with 1 supplement see all
Second-order motion detectors perform poorly with natural scene inputs.

(A) Schematics of the Hassenstein-Reichardt correlator (HRC). Each half of the HRC receives inputs from two nearby points in space, which are filtered in time with filters f(t) and g(t), and then multiplied together. The full HRC receives outputs from two symmetric halves with opposite direction tuning and subtracts two outputs. (B) An example two-dimensional photograph from a natural scene dataset (top), including a one-dimensional section (image) through the photograph (bottom), indicated by the red line. So that the image can be viewed clearly, the contrasts in the photograph were mapped onto gray levels so that an equal number of pixels were represented by each gray level. (C) Average response (line) and variance (shaded) of the outputs of an HRC (equivalent to a motion energy model; Adelson and Bergen, 1985) when presented with naturalistic motion at various velocities. Images were sampled from natural scenes (green) or from a synthetic image dataset in which all higher order structure was eliminated (purple, see Materials and methods). (D) Example synthetic image in which all higher order structure was eliminated.

The spatiotemporal contrast signals from these (image, velocity) pairs were used as inputs to the HRC model, and we evaluated the model’s output for fixed image velocities across different scenes (Figure 1C, Materials and methods). The model generated a mean response that was linearly tuned for small velocities, peaked at around 130 °/s, and then decayed to zero for fast speeds (Figure 1C green line). However, we observed substantial variance about the mean response, and this variance implies that different natural scenes generated different second-order motion estimates, even when moving at the same velocity (Figure 1C green shading). This is consistent with the finding that canonical second-order motion detectors generate variable responses with natural scene inputs (Dror et al., 2001; Fitzgerald and Clark, 2015; Sinha et al., 2018).

Next we sought to investigate how the higher order structure of natural scenes influences the performance of the second-order motion estimates. Though canonical motion detectors use only pairwise spatiotemporal correlations, higher order statistics of static images, such as contrast kurtosis, influence the detector’s variance (Clark et al., 2014; Fitzgerald and Clark, 2015). To demonstrate this, we generated a synthetic image set in which we preserved the second-order statistics of natural scenes, including their spatial correlation function and contrast variance, but eliminated all higher order structure (Figure 1D, Materials and methods). When the higher order structure was eliminated, the HRC’s average tuning was unchanged, but there was a marked decrease in the variance (Figure 1C purple). This demonstrates that higher order structure in natural scenes induces variability in canonical motion estimates.

Modeling fly motion computation with second- and third-order Volterra kernels

To investigate how real visual systems compute motion, we wanted to systematically characterize an animal’s motion computation at the algorithmic level (Marr and Poggio, 1976). Motion computation requires a nonlinear transformation to form a motion estimate from the visual stimulus (Borst and Egelhaaf, 1989; Fitzgerald et al., 2011; Poggio and Reichardt, 1973). We approximated this nonlinear transformation using a Volterra series expansion (Marmarelis and McCann, 1973; Marmarelis, 2004; Schetzen, 1980; Wiener, 1966). Similar to the Taylor series from calculus, the Volterra series is a polynomial description of a nonlinearity, with a first-order kernel that describes linear transformations, a second-order kernel that captures quadratic terms, and higher-order kernels that combine to represent a wide variety of nonlinearities beyond the second-order. However, many polynomial terms can be needed to describe some nonlinearities. For instance, the polynomial description of a compressive, saturating nonlinearity is inefficient, and it can be easier to describe such transformations using alternative nonlinear model architectures, such as linear-nonlinear cascade models (Dayan and Abbott, 2001). We emphasize that the Volterra kernel description is explicitly algorithmic, as it aims to summarize the overall system processing without considering the mechanisms leading to this processing.

Volterra kernels are useful for studying visual motion processing because they allow us to rigorously group response properties by their order (Fitzgerald et al., 2011; Potters and Bialek, 1994), thereby permitting us to clearly describe both canonical and contrast polarity-dependent components of the behavior. For example, the second-order kernel is equivalent to the canonical motion detecting algorithms, as it explains the sensitivity to pairwise spatiotemporal correlations (Fitzgerald et al., 2011; Salazar-Gatzimas et al., 2016). Second-order Volterra kernels, along with related spike-triggered covariance methods (Bialek and van Steveninck, 2005; Sandler and Marmarelis, 2015; Schwartz et al., 2006), have been used to model second-order behavior and neural processing in flies and primates (Clark et al., 2011; Marmarelis and McCann, 1973; Poggio and Reichardt, 1973; Rust et al., 2005; Salazar-Gatzimas et al., 2016). However, the second-order kernel cannot capture the system’s sensitivity to triplet spatiotemporal correlations. We therefore minimally extended the depth of the Volterra series expansion to include the third-order kernel. The third-order kernel directly measures sensitivities to triplet spatiotemporal correlations and probes ON/OFF asymmetries in motion processing.

Experimental measurements of Volterra kernels in fly behavior

We focused on how the fly responds to correlations between nearest-neighbor pixels in the visual input, which corresponded roughly to a single ommatidium separation (Buchner, 1976) (Figure 2A). The second-order kernel describes how the behavioral response is influenced by the product of contrasts at each pair of spatiotemporal points in the visual input (Figure 2B blue). In comparison, the third-order kernel describes how the response is influenced by the product of contrasts at each triplet of spatiotemporal points in the visual input (Figure 2B green). Note that triplet spatiotemporal correlations could in principle be computed across three distinct spatial locations, but our analysis focused on triplet spatiotemporal correlations distributed across two nearest-neighbor pixels.

Figure 2 with 1 supplement see all
We modeled the fly’s motion computation algorithm with second- and third-order Volterra kernels and extracted the kernels using reverse-correlation (Appendix 1, Materials and methods).

(A) Visual depiction of our model, where the fly’s motion computation system consists of a spatial array of elementary motion detectors (EMDs), and each EMD receives inputs from two neighboring spatial locations. We presented flies with vertically uniform stimuli with 5°-wide pixels, roughly matching ommatidium spacing. (B) Diagram showing how the output of one EMD at time t, indicated by gray dashed line, is influenced by the second- and third-order products in the stimulus. Left: visual inputs of one EMD, sL and sR. The visual stimulus contained products of pairwise and triplet points with various spatiotemporal structure. One specific pairwise product is highlighted (blue barbell) and one specific triplet product is highlighted (green triangle). Middle: The motion computation of the EMD is approximated by the second-order kernel (blue) and the third-order kernel (green). The second-order kernel (blue) KLR2(τ1,τ2) is a two-dimensional matrix. For example, the response at time t is influenced by the products of sL(t-7) and sR(t-9) with weighting KLR2(7, 9). The third-order kernel (green) KLRL3(τ1,τ2,τ3) is a three-dimensional tensor. The response at time t is influenced by sL(t-1)sR(t-4)sL(t-3) with weighting KLRL3(1, 4, 3). Right: turning response at time t is influenced by all pairwise and triplet products in the visual stimulus, with weightings given by the second- and third-order kernel elements. (C) Diagram of the fly-on-a-ball rig. We tethered a fly above a freely-rotating ball, which acted as a two-dimensional treadmill. We presented stochastic binary stimuli, and measured fly turning responses. (D) The extracted second-order kernel. The color represents the magnitude of the kernel, with red indicating rightward turning and blue indicating leftward turning to positive pairwise spatiotemporal correlations. Above the diagonal line, the matrix represented left-tilted pairwise products (example in B) and below the diagonal line represents right tilted pairwise products. (E) The extracted third-order kernel. For visualization purposes, we show only the two diagonals with the largest magnitude.

In order to extract Volterra kernels, especially higher-order ones, we needed a large amount of data. We thus developed a high-throughput setup to measure turning in walking flies in response to visual stimuli (Figure 2C) (Creamer et al., 2018; Creamer et al., 2019). In this setup, a fly’s optomotor turning response serves as a readout of its motion perception (Götz and Wenking, 1973; Hassenstein and Reichardt, 1956), which allowed us to characterize the fly’s motion computation algorithm by measuring its visually-evoked turning response. Flies spend a large portion of their lives standing and walking on surfaces, making walking optomotor responses ethologically critical (Carey et al., 2006).

We extracted the second- and third-order Volterra-kernels with reverse-correlation methods. To do this, we presented flies with spatiotemporally uncorrelated binary stimuli on a panoramic screen around the fly, measured their turning responses, and correlated the behavior at each time to the stimuli preceding it (Materials and methods, Appendix 1, Figure 2—figure supplement 1) (Clark et al., 2011; Mano and Clark, 2017; Salazar-Gatzimas et al., 2016). The measured second-order kernel showed positive and negative lobes (Figure 2D). The positive lobe below the diagonal indicates that flies turned to the right when presented with positive correlations in the rightward direction. This second-order kernel is consistent with classical models of motion computation and with previous neural and behavioral measurements (Clark et al., 2011; Marmarelis and McCann, 1973; Salazar-Gatzimas et al., 2016). The measured third-order kernel also showed both positive and negative values (Figure 2E), and we will dissect its detailed structure later in this manuscript. However, we first set out to evaluate how the third-order kernel contributed to motion estimation across an ensemble of moving natural images.

The third-order kernel improves velocity estimation for moving natural scenes

The kernels were fit to turning behavior, so the output of the model to moving visual stimuli is the predicted optomotor turning response. Following previous work (Clark et al., 2014; Fitzgerald and Clark, 2015; Leonhardt et al., 2016; Poggio and Reichardt, 1973; Potters and Bialek, 1994; Sinha et al., 2018), we hypothesized that optomotor turning responses provide a proxy for the fly’s velocity estimate. Using the fitted behavioral model, we could thus investigate how accurately the fly’s velocity estimate tracks the true image velocity. We evaluated the fly’s motion computation performance with a simple and specific metric: when an entire natural image translates rigidly with constant velocity, how accurately does the behavioral algorithm predict the image velocity (Figure 3A)? Specifically, does the fly use its sensitivity to triplet spatiotemporal correlations to improve velocity estimation?

Figure 3 with 2 supplements see all
The third-order kernel improved motion estimation in natural scenes.

(A) Predicting responses of the second- and third-order kernels to rigidly moving scenes. Top: natural scenes rigidly translating with constant velocities. Middle: cartoon of the second- and third-order kernels. Bottom: second-order response (blue), third-order response (green), the predicted motion estimate (red) is the summation of r(2) and r(3). (B) Scatter plot of r(2), r(3) and r(2)+r(3) against image velocity over the ensemble of moving images. 10,000 independent trials were simulated, and 1000 trials were plotted here. (C) Pearson correlation coefficients between responses of each kernel and the true image velocities (ρ = 0.44 ± 0.01, -0.14 ± 0.01, 0.54 ± 0.01, 0.56 ± 0.01, from left to right; w = 1.39 ± 0.01; mean ± SEM across 10 groups of 1000 trials). (D) Scatter plot between r(3) and the residual in r(2), computed by subtracting a scaled image velocity from r(2) (Materials and methods). ρ represents the Pearson correlation coefficient mean ± SEM across 10 groups (Materials and methods).

We sampled the velocities from a zero-mean Gaussian distribution with a standard deviation of 114 °/s: this distribution roughly matched turning distributions in walking flies (DeAngelis et al., 2019; Katsov and Clandinin, 2008). Crucially, because we measured the Volterra kernels, we could separate the fly’s predicted output into two components: the canonical second-order response, r(2), and the non-canonical third-order response, r(3) (Figure 3A). The second-order response is the output from the second-order kernel, and it describes how the fly responded to naturalistic second-order spatiotemporal correlations in the stimulus. Similarly, r(3) is the output from the third-order kernel, and it describes how the fly responded to naturalistic triplet spatiotemporal correlations. This separation allowed us to ask how the pairwise and triplet correlations are individually and jointly used to estimate motion.

We quantified how well the model’s responses predicted the image velocity using the Pearson correlation coefficient (Clark et al., 2014; Fitzgerald and Clark, 2015; Leonhardt et al., 2016). This metric supposes that the model response and image velocity are linearly related, and its value summarizes intuitively the mean-squared-error of the best linear fit between the model’s output and the image velocity. When the correlation coefficient has an absolute value near 1, the model closely tracks image velocity, while a value near 0 indicates no linear relationship between model and image velocity. The responses derived from the second-order kernel, r(2), correlated positively with the true velocity (Figure 3B blue), indicating that the second-order response matches the behavioral direction (Clark et al., 2011; Hassenstein and Reichardt, 1956; Salazar-Gatzimas et al., 2016). Interestingly, the isolated third-order response, r(3), anti-correlated with true image velocities (Figure 3B green). This means that the fly’s third-order response on its own would predict that the fly turns in the direction opposite to the presented motion. However, when r(3) was added to r(2), the accuracy of the full motion estimator increased by ~25% compared to r(2) alone (Figure 3B red, Figure 3C). This important result shows that the third-order responses improve velocity estimates only in conjunction with second-order responses.

To understand this counterintuitive finding, it’s useful to recognize that the second-order response is influenced by both the image velocity and the structure of the natural scene. For example, recall that the output of the HRC depended both on the velocity of motion and on the particular image that was moving (Figure 1C). Thus, one way to improve the accuracy of the response is to reduce scene-dependent variability in the second-order estimate. To investigate whether this interpretation explained the observed improvement, we calculated the residuals of the second-order responses by subtracting the best linear fit of the image velocity and plotted them against the third-order responses. We found that the third-order signal was strongly anticorrelated with this scene-induced residual in the second-order response (Figure 3D). This means that the fly’s sensitivity to triplet spatiotemporal correlations indeed canceled scene-dependent variability in the second-order motion estimator to improve the accuracy of motion estimation across natural scenes.

Since the magnitude of the second-order kernel and third-order kernel were each measured experimentally, our model combined r(2) and r(3) with a 1:1 ratio. Nevertheless, we were interested in whether the fly could have done better with alternate weighting coefficients, so we fit a linear regression model to reweight r(2) and r(3) to best predict image velocity. Strikingly, we found the optimized relative weighting between r(2) and r(3) was near one, and the performance of the best weighted model was only marginally better than the empirical model (Figure 3C gray). Thus, the measured second- and third-order kernels were weighted near optimally for performance in naturalistic motion estimation.

We also wanted to understand how the improvement added by r(3) depended on the parameters of our simulation. To see how it depended on the width of the image velocity distribution, we varied the standard deviation over an order of magnitude. The improvement did not depend strongly on the variance of the velocity (Figure 3—figure supplement 1). We also asked how the contrast computation affected the performance of the measured algorithm. When we previously converted luminance into contrast signals, we computed local contrasts on a length scale of 25° (measured by full-width-at-half-maximum), because that is the approximate spatial scale of surround inhibition measured in flies (Arenz et al., 2017; Freifeld et al., 2013; Srinivasan et al., 1982). When we swept this spatial scale from 10° to 75°, the improvement added by the third-order kernel first increased, peaked at around 30°, and then decreased to negative values after 40° (Figure 3—figure supplement 2A–E). When we computed the contrast over time, instead of space, we observed improvements on timescales less than 100 ms, comparable to measured timescales involved in early visual neurons that compute temporal derivatives (Behnia et al., 2014; Srinivasan et al., 1982; Yang et al., 2016) (Figure 3—figure supplement 2FG). However, the third-order term hurt performance when contrasts were computed on longer timescales. These results show that contrast computations influence the utility of the measured third-order kernel, with maximal utility occurring in a regime that approximately matches the contrast computation of the fly eye.

Visualizing the measured third-order kernel with impulse responses

Since the measured third-order Volterra kernel improved motion estimates, we wanted to characterize it in more detail. To better visualize the third-order kernel, we rearranged its elements in an impulse response format (Figure 4AB, Materials and methods). The impulse response of a system is its output when presented with a small and brief input, called an impulse. This impulse may consist of a change in contrast at a single point, in which case the impulse response captures the linear response of the system. Analogously, if the impulse consists of a contrast triplet over three points in space and time, then the triplet impulse response captures the system’s response to the interactions of those three points, after accounting for those responses already explained by linear or second-order impulse responses.

Figure 4 with 3 supplements see all
Characterization and validation of the measured third-order kernel.

(A) Triplet impulse response description. Top: the ball-stick diagram represents the relative spatiotemporal position of three points in a triplet. The red line denotes the temporal distance between the two left points, Δτ31, and the blue line denotes the temporal distance between the more recent point on the left and the sole right point, Δτ21. Bottom: Three specific example occurrences of the triplet elicit three impulse responses. The response at time t is the sum of the impulse responses to all previous occurrences of the triplet. The first triplet (lightest green) involves two black points and one white point, so their product is positive, and it elicits an impulse response with positive sign. The triplet occurs far from current time t, so its influence on the current response is small. The last triplet (darkest green) involves two white points and one black point, so the product is negative and it elicits an impulse response with flipped sign. It is close to current time t, and has a large influence on the current response. (B) Third-order kernel visualized using an impulse-response format (Materials and methods). Top: the ball-stick diagrams as in (A). Bottom: the color map plots the 'impulse response' to the corresponding triplets, and color represents the strength of the kernel. Different panels represent different Δτ31. In each color map, Δτ31 is fixed, the columns represent Δτ21, and the rows represent the time since the most recent point in each triplet. The dashed lines indicate the place where the right point is in the middle of the two left points in time. (C) The summed strength of the third-order kernel along each column in A. Error bars represent SEM calculated across flies (n = 72), and significance was tested against the null kernel distribution (*p<0.05, two-tailed z-test (Materials and methods)). (D) The scatter plot between the measured responses to third-order glider stimuli (Materials and methods, Figure 4—figure supplement 1) against responses predicted by the third-order kernel (Materials and methods). The correlation between the predicted and measured responses is 0.76. Black dashed line is unity; gray dashed line is the best linear fit. (E) The measured second- and third-order kernel share temporal structures. Top: the ball-stick diagrams represent the relative spatiotemporal positions of the two points in each pair (blue), and three points in each triplet (green). Bottom: The kernel strength of the second-order kernel (blue) and third-order kernel (green) summed across all elements sharing the same spatiotemporal structures, that is summed over rows in Figure 4—figure supplement 2CD (Materials and methods). Figure 4—figure supplement 3, (F) The extracted third-order kernels from four optimized motion detectors (Fitzgerald and Clark, 2015) compared to the measured kernel from the fly. The summed kernel strength is summed across all elements which shared the same spatiotemporal structures diagramed above.

Triplet correlation impulse responses are useful because they allow one to rapidly digest how different triplet correlations will affect behavior. For example, in Figure 4A, we colored three occurrences of triplets that have the same spatiotemporal structure and their corresponding triplet impulse responses. We set the origin of the impulse responses to be the most recent point in the triplet, because the system could not respond to the interaction of three points before all three points were presented. Since the spatiotemporal structures of these three triplets are the same, the three impulse responses have the same shape. Note that a negative impulse, consisting of an odd number of dark elements within the triplet, would drive turning in the opposite direction. In Figure 4B, we represented impulse responses of different triplet with colormaps and used ball-stick cartoons to show the relative temporal distances between the points in each triplet. The predicted time course of the behavioral effect is easy to discern, and the kernel predicts that the behavioral consequences of triplet correlations will last almost a second. We can more compactly understand the relative magnitudes of the behavioral effects by summing the impulse responses over time (Figure 4C) (Salazar-Gatzimas et al., 2016). As expected, the impact of different triplet correlations varies significantly in both direction and magnitude.

Verification of the third-order kernel measurement

We verified the reliability of our third-order kernel measurement in two ways. First, we tested the statistical significance of the measured kernel directly. We extracted an ensemble of null kernels by applying the reverse-correlation analysis to the measured behavioral responses and temporally-shifted visual stimuli (Materials and methods). By comparing summed kernel elements in the empirical and null kernels, we found that many terms in the third-order kernel were statistically significant at the p=0.05 level (Figure 4C). Significance was especially common when the temporal distance between the points in the triplet spatiotemporal correlation was less than 0.1 s.

Second, we measured the fly’s sensitivity to triplet spatiotemporal correlations with third-order glider stimuli (Figure 4D, Figure 4—figure supplement 1). Third-order glider stimuli are binary stimuli that lack pairwise correlations and are enriched in specific triplet spatiotemporal correlations (Clark et al., 2014; Hu and Victor, 2010). We used the measured third-order kernel to predict responses to the glider stimuli. Most of the measured responses were quantitatively predicted by the third-order kernel (Figure 4D). Several gliders elicited smaller behavioral responses compared to the kernel prediction; such differences might be attributable to induced long-range spatial correlations in glider stimuli (Clark et al., 2014; Hu and Victor, 2010), which are not captured by our measured nearest-neighbor kernel. Nevertheless, the successes revealed by this independent experimental test strongly suggest that we had enough statistical power to reliably fit the third-order kernel to the behavioral data.

The second- and third-order kernels share temporal structure

Multiple models propose that sensitivity to pairwise and triplet spatiotemporal correlations could emerge simultaneously from the same nonlinear step in the fly brain (Fitzgerald and Clark, 2015; Leong et al., 2016; Leonhardt et al., 2016). We were thus curious whether the measured second- and third-order kernels had a common temporal structure. To compare the second-order and third-order kernels, we simplified the third-order kernel to a two-dimensional approximation (Figure 4—figure supplement 2ABC), rearranged the second-order kernel into the impulse response format (Figure 4—figure supplement 2D), and computed summed kernel strengths to obtain one-dimensional representations for both kernels (Figure 4E). We compared the second- and third-order kernel elements at the same temporal offsets (Figure 4E top). In the case of pairwise correlations, the temporal offset was determined by the temporal distance between the left and the right points, and in the case of triplet correlations, the temporal offset was determined by the average temporal distance between the left and right points. The summed kernel strengths showed that the second-order and third-order kernels had similar sensitivities to temporal delays between the input pixels, with peak sensitivity at the shortest delays in our experiment (Figure 4E bottom). An analysis employing the singular value decomposition yielded similar results, and also showed comparable kinetics in the behavioral responses to pairwise and triplet correlations (Figure 4—figure supplement 2EFG). These similarities suggest that the second- and third-order responses originate in common physiological processes.

Comparing the measured third-order kernel to optimal motion estimators

A recent theoretical study proposed several motion detectors whose parameters were optimized for velocity estimation in natural scenes (Fitzgerald and Clark, 2015) (Figure 4—figure supplement 3). In order to compare our measured third-order kernel to those of these optimized motion detectors, we presented stochastic binary stimuli to these detectors and extracted their third-order kernels using reverse-correlation. We found that the third-order kernels of the optimized models were usually similar to each other (Figure 4F), which is consistent with prior analyses (Fitzgerald and Clark, 2015). The measured third-order kernel consistently agreed with the optimized kernels in its signs, and in some cases, the kernels were also similar in magnitude. However, certain kernel elements differed markedly between the optimized models and the behaviorally measured kernel. Perhaps most noticeably, the behavioral kernel was much smaller than the optimized kernels for correlations whose spatiotemporal structure involved large delays between the points (Figure 4F, third and fifth kernel elements). Such differences between the optimized models and the measured behavior could indicate suboptimalities in the fly brain. However, they could also result from unrealistic constraints imposed on the model optimization, such as fixed temporal processing and restricted model structures (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). The measured kernel therefore provides valuable new data to inform theoretical work assessing the optimality of biological motion estimators.

Positive skewness is sufficient for the third-order kernel to improve motion estimates

Which features of natural images allow the measured third-order kernel to improve motion estimates? The natural scene dataset is comprised of heterogeneous individual images (Figure 5A), so we calculated the contrast mean, variance, skewness, and kurtosis of each image individually. The variance describes the scale of the contrast variation; the skewness quantifies imbalance between contrasts above and below the mean; and the kurtosis roughly characterizes the frequency of extreme bright and dark points. Each of these statistics showed a wide distribution over the image ensemble (Figure 5—figure supplement 1ABCD). These statistics were also highly dependent on each other: a positively skewed image often had high variance and was highly kurtotic (Figure 5B, Figure 5—figure supplement 1E). These strong relationships make it difficult to isolate the effects of individual statistics within the image ensemble.

Figure 5 with 3 supplements see all
Positive skewness is sufficient for the third-order kernel to improve motion estimation.

(A) Three example natural scenes with different degrees of variance and skewness. (B) Joint-density maps of individual image statistics over the ensemble of natural images, showing the relationship between skewness and variance (left) and skewness and kurtosis (right). (C) Contrast distributions of example images from the natural scene dataset and two synthetic image datasets. The natural image is shown (black), along with a maximum entropy distribution (MED) with matched mean and variance, denoted by +mean +var (green) and an MED with matched mean, variance, and skewness, denoted by +mean +var +skew (brown). (D) Example images from the natural scene dataset and two synthetic image datasets corresponding to three contrast distributions in (C). (E) The Pearson correlation coefficient between true image velocities and each kernel’s responses in the two synthetic datasets +mean +var (green) and +mean +var +skew (brown) (F) Example of MEDs in six synthetic datasets, in which the image skewness ranged from −1.25 to 1.25. (G) Example of synthetic images in three synthetic datasets, corresponding to MEDs in (F) with constrained skewness of 0.25 (top), 0.75 (middle), and 1.25 (bottom). (H) Improvement added by the third-order response as a function of synthetic image skewness.

To isolate the effects of individual statistics, we therefore generated several different synthetic image datasets that have alternate contrast statistics (Materials and methods). To generate each image, we constructed a synthetic contrast distribution and sampled pixel contrasts from this distribution (Appendix 2, Figure 5—figure supplement 2A). In this way, we could manipulate the statistics of the image by constraining various statistics of the distribution. In particular, we constrained the distribution to have specific lower order moments, such as mean, variance, and skewness. A distribution is not solely determined by its lower order moments, so there can be many distributions sharing the same lower order moments (Figure 5—figure supplement 2A). Among all such distributions, we chose the most random one, known as the maximum entropy distribution (MED) (Berger et al., 1996; Jaynes, 1957; Schneidman et al., 2006; Victor and Conte, 2012). Because we can specify these lower order moments independently, we can ask whether specific statistics are sufficient to generate the improvement added by the third-order signal.

We began by generating two synthetic image datasets (Materials and methods). In the first dataset, we generated a synthetic image for each natural image that had the same contrast mean and variance. To do this, we first found an MED whose contrast mean and variance matched those of the natural image (Figure 5C, green). We then generated a single synthetic image by sampling from this MED (Figure 5D green). In the second dataset, we required the synthetic image to have the same contrast mean, variance, and skewness as the original image (Figure 5CD, brown). By retaining the skewness, these synthetic images retained naturalistic light-dark asymmetries. We then asked how the third-order response affected velocity estimates across these two synthetic image datasets. When only the mean and variance of natural scenes were retained, the third-order response was near zero (Figure 5—figure supplement 2A, green) and did not improve motion estimation (Figure 5E). However, when the synthetic scenes were constrained to be naturalistically light-dark asymmetric, the improvement added by the third-order kernel was recovered (Figure 5E), with magnitude comparable to what was observed for the natural scene dataset (Figure 3C).

Finally, we wanted to see whether the degree of skewness controlled the magnitude of the improvement. We therefore generated synthetic image datasets in which we systematically varied the image skewness (Figure 5FG). In these synthetic images, the degree of skewness determined how much the third-order response could improve the full motion estimate (Figure 5H). When the images were negatively skewed, the third-order response correlated with image velocity (Figure 5—figure supplement 3C). However, since it also positively correlated with the residual in the second-order response, adding it to the second-order response decreased the model’s overall performance (Figure 5—figure supplement 3CD). When the images were positively skewed, the third-order response became anticorrelated with the residual in the second-order response and thus improved the overall motion estimates (Figure 5H, Figure 5—figure supplement 3CD). These synthetic image sets show that positive image skewness is sufficient for the third-order signal to improve motion estimates. Thus, the measured algorithm for motion estimation leverages light-dark asymmetries found in natural scenes to improve motion estimates.

Discussion

In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second- and third-order terms in the Volterra series contributed to the turning behavior. We then evaluated the model’s output when it was presented with an ensemble of rigidly translating natural scenes. There, the second- and third-order terms of the model combined to produce outputs that better correlated with image velocities. There is no a priori reason to assume that a model fit to explain turning behavior would necessarily predict the image velocity. Therefore, these results can be taken together to motivate the hypothesis that the magnitude of the fly’s turning response is determined by an internal estimate of velocity. Furthermore, this estimate is specifically tailored for natural environments, since we found that the third-order kernel relies on light-dark asymmetries that are present in natural scenes but not in arbitrary images. Since skewed scenes are prevalent across natural environments, and many visual systems exhibit ON-OFF asymmetric visual processing (Chichilnisky and Kalmar, 2002; Jin et al., 2011; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010; Zaghloul et al., 2003), many animals are likely to use similar strategies for motion perception.

Direct demonstration that triplet correlations improve velocity estimation for natural images

The idea that features of the visual motion computation serve to improve performance in natural environments is conceptually appealing and theoretically powerful. For example, prior studies have found that optimized motion detectors had triplet correlation sensitivities similar to those measured from the fly visual system (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). Although it is intriguing that biologically relevant response properties emerged in optimized motion detectors, the link from contrast-polarity dependent motion computation to naturalistic motion estimation remained indirect. Here we have provided the first direct demonstration that third-order components of the fly’s motion computation algorithm improve velocity estimation for moving natural scenes. This direct demonstration had not been possible before because prior measurements were limited to a narrow range of correlations, and it was unclear how the measured cues interacted with unmeasured components of the motion estimation algorithm. For example, here we found that the third-order kernel appeared counterproductive when viewed in isolation, but the way flies incorporated triplet correlations was easily interpretable via the deficits of the second-order motion estimator. Remarkably, this problem could have persisted if motion computation needed to be understood in the context of additional visual motion cues that involve longer spatial-scales or higher order nonlinearities, which have been neglected in this study. This suggests that it might be sufficient to comprehensively characterize motion computation with a few local and low-order visual cues, which is encouraging for the approach outlined here. On the other hand, a mechanistically accurate model of visual motion processing could eventually summarize the relevant cues in a more succinct and less abstract way.

Flies use triplet correlations to cancel scene-dependent variability in second-order cues

We found that the third-order responses in flies were anti-correlated with the natural image velocities (Figure 3BC). Nevertheless, they improved velocity estimation when added to the second-order responses (Figure 3D). This result appears counter-intuitive at first but can be understood. Spatiotemporal correlations are influenced by both the motion and local structure of the scene, but motion-driven behaviors should ignore fluctuations stemming from the scene’s structure as much as possible. Pairwise and triplet spatiotemporal correlations are related to contrast variance and skewness, respectively, which are correlated across the ensemble of natural scenes (Figure 5B). This means that fluctuations in second-order signals tend to be accompanied by fluctuations in third-order signals. Therefore, with the right weighting, second-order and third-order signals may collaborate to reduce the image-induced signal fluctuations in the motion estimate (Clark et al., 2014). Indeed, we found that the third-order responses improved motion estimates because they helped to cancel variability in the second-order responses induced by the structure of natural scenes. This finding highlights a generally important but underappreciated point about cue combination and population coding in neural systems. Although a neuron’s tuning is often used as a proxy for its involvement in stimulus processing, even untuned neurons can contribute productively to downstream decoding if their responses are correlated with noise in the tuned neuronal population (Zylberberg, 2017).

Large computational benefits can underlie small behavioral effects

The HRC has explained a large number of behavioral phenomena and neural responses, and it is reasonable to ask how much we have gained by extending its second-order algorithmic description to a third-order one. The magnitude of behavior elicited by the third-order kernel is small compared to the second-order kernel’s contribution. However, the magnitude of the behavioral effect and the magnitude of the underlying performance gain can differ significantly. For example, the largest reported turning response elicited by a third-order glider stimulus is less than 20% of that elicited by second-order glider stimuli (Clark et al., 2014), yet the performance gain afforded by a motion detector designed to detect its defining third-order correlation exceeded 30% (Fitzgerald and Clark, 2015). Similarly, here we predict that many natural images would elicit little output from the third-order kernel (Figure 3B), yet third-order responses improved the correlation between the model output and velocity by ~25% (Figure 3C). These performance gains are modest, but they are comparable to the inarguable benefits provided by spatial averaging and could be ecologically relevant to the fly (Dror et al., 2001; Salazar-Gatzimas et al., 2018). Since we have only approximated the system with a spatially localized low order polynomial, we also expect that the improvements we observed here represent a lower bound on the total effects provided by the full mechanism underlying light-dark asymmetric motion processing. Indeed, longer range and higher order motion detection models can nearly double motion estimation accuracy while predicting realistic glider response magnitudes (Fitzgerald and Clark, 2015). It will be interesting to investigate whether mechanistically accurate models that explain the origin of the third-order kernel also reveal larger performance improvements.

Asymmetric ON-OFF processing could affect motion processing across the animal kingdom

Here, we showed that flies systematically exploit contrast asymmetries in natural scenes to improve their visual motion estimates. This resonates with previous work showing that many visual systems process ON and OFF signals asymmetrically to improve other aspects of visual processing (Kremkow et al., 2014; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010). Moreover, flies and vertebrates share striking anatomical and functional properties in their motion detection circuits (Borst and Helmstaedter, 2015; Clark and Demb, 2016; Sanes and Zipursky, 2010), likely because they are solving similar problems with similar constraints. We thus expect that many visual systems use ON-OFF asymmetric processing to improve visual motion perception. Nevertheless, it remains unclear how similar or different the details of such strategies will be across the animal kingdom. Indeed, although both primates and insects respond to third-order glider stimuli, their patterns of response differ (Clark et al., 2014; Hu and Victor, 2010; Nitzany et al., 2017). ON-OFF asymmetric visual processing also varies in other ways, and there is evidence that contrast adaptation in ON and OFF pathways is different between primate and salamander retinas (Chander and Chichilnisky, 2001).

Adaptations to differences in both habitat and early sensory processing could potentially explain these divergences. Here, we found positive and negative contrast skewness in different terrestrial scenes (Figure 5—figure supplement 1C), but positive skewness was most prevalent across the scenes. If certain habitats feature scenes that are predominantly negatively skewed, our work predicts that animals living in these habitats should have opposite third-order responses to flies (Figure 5H). More generally, natural scenes in different habitats are known to be similar in some statistics and different in others (Balboa and Grzywacz, 2003; Burkhardt et al., 2006). Interestingly, positive skewness might be particularly common in natural luminance distributions, because luminance signals are the product of many independent factors that generically combine to produce log-normal distributions (Richards, 1982). Nevertheless, the skewness level that matters for motion detecting circuits also depends on earlier processing operations in the eye (Figure 3—figure supplement 2C). In our numerical experiments, we computed local contrast signals. The spatial scales of this preprocessing influenced the statistics of the resulting contrast, which in turn influenced the performance of the motion computation. Biologically, this suggests that signal processing in early visual circuits can strongly influence how downstream circuits organize their computations (Dror et al., 2001; Fitzgerald and Clark, 2015). Alternatively, early sensory processing might be tailored to accommodate the computational requirements of downstream processing. These possibilities are not mutually exclusive, and in both cases, the early visual processing must work in concert with the downstream motion detectors to form robust and consistent perceptions.

Volterra kernels systematically characterize nonlinear motion computations

The algorithm used by the visual system to extract motion signals is a nonlinear transformation from light detection to motion estimates. There are numerous ways to characterize a nonlinear system. In many cases, visual neuroscientists have purposefully designed stimuli, such as sinusoidal gratings, plaids, and gliders, to probe specific nonlinearities in the system (Clark et al., 2014; Creamer et al., 2018; Euler et al., 2002; Fisher et al., 2015; Haag et al., 2016; Hu and Victor, 2010; Movshon et al., 1985; Rust et al., 2005; Salazar-Gatzimas et al., 2018; Salazar-Gatzimas et al., 2016). In other cases, they have used stochastic stimuli to fit simple predictive models, such as linear-nonlinear models, generalized linear models, cascade models, and normalization models to capture a restricted but relatively broad set of biologically plausible nonlinearities (Dayan and Abbott, 2001; Leong et al., 2016; Maheswaranathan et al., 2018; McIntosh et al., 2016; Salazar-Gatzimas et al., 2016; Simoncelli and Heeger, 1998).

Here, we approximated the nonlinear system with Volterra kernels (Marmarelis and Naka, 1972; Wiener, 1966). This represents a general and systematic approach to nonlinear system identification, since (1) one need not make strong assumptions about the system to measure its kernels, (2) higher order kernels can in principle be added to characterize the system arbitrarily well, and (3) a complete set of kernels predicts the system’s output for arbitrary input signals. One major limitation of this approach is that higher order kernels become progressively more difficult to fit as the number of kernel elements increases. This makes the approach most practical when a few low-order terms already capture conceptually important variables. Here, we leveraged the fact that second-order kernel capture the canonical models for visual motion estimation while third-order kernel probes ON/OFF asymmetries in motion processing. These two kernels can be related to distinct statistics of natural scenes.

Polynomial approximations to complex nonlinear systems have also been useful in other domains of neuroscience. For example, the experimental phenomenon of frequency-dependent long-term potentiation can be explained by extending canonical pairwise spike-timing-dependent plasticity models to include the relative timing of three spikes (Pfister and Gerstner, 2006; Sjöström et al., 2001). This makes learning sensitive to third-order correlations (Gjorgjieva et al., 2011). In the field of texture perception, researchers have long sought low order statistics that explain whether two patterns are texturally discriminable (Julesz, 1962; Julesz et al., 1973; Julesz et al., 1978). Similar to our findings for motion perception, both natural scene statistics and upstream visual processing play important roles (Hermundstad et al., 2014; Portilla and Simoncelli, 2000; Tkacik et al., 2010). As a final example, understanding how neural network structure impacts dynamics was aided by formally expanding the network’s connectivity matrix into low-order connectivity motifs (Hu et al., 2014; Trousdale et al., 2012). These motifs might relate to measurable properties of the neocortex (Song et al., 2005).

Velocity estimation is a useful approximation to motion computation

In this paper, we evaluated the fly’s motion computation algorithm by measuring the accuracy of velocity estimation. Prior studies have often hypothesized that velocity estimation is a key requirement of motion processing and optomotor circuitry (Clark et al., 2014; Dror et al., 2001; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Poggio and Reichardt, 1973; Potters and Bialek, 1994). It is thus reassuring that a model that better fit optomotor behavior also predicted image velocity more accurately. Nevertheless, motion computation is involved in perceptual tasks beyond the optomotor response, such as detecting looming stimuli (Card and Dickinson, 2008; Zacarias et al., 2018). In such tasks, the goal may not be to estimate the velocity of the visual object, but spatiotemporal correlations might nevertheless be useful (Nitzany and Victor, 2014). In addition, fly motion detecting neurons respond to static sinusoids or local luminance changes without obvious relevance for motion processing (Fisher et al., 2015; Gruntman et al., 2018; Salazar-Gatzimas et al., 2018), which suggests that motion computation algorithms might be jointly optimized alongside the detection of other visual features. Finally, visual systems have to coordinate with motor systems to achieve accurate sensorimotor transformations, so one should take the properties of the motor system into consideration when evaluating the performance of a motion computation algorithm (Dickinson et al., 2000). Future work could consider more sophisticated evaluation metrics that better reflect the total ethological relevance of the visual environment to the fly. As we continue to distill the factors that combine to set the ultimate performance criteria, the use of velocity estimation is likely to remain a simple, useful, and insightful approximation.

Potential mechanisms underlying the measured light-dark asymmetries

Visual systems in both vertebrates and invertebrates split into ON and OFF pathways that process light and dark signals separately and asymmetrically (Balasubramanian and Sterling, 2009; Chichilnisky and Kalmar, 2002; Clark and Demb, 2016; Leonhardt et al., 2016; Ratliff et al., 2010; Ravi et al., 2018; Sagdullaev and McCall, 2005; Salazar-Gatzimas et al., 2018). Some of these differences could result from biological constraints. Others could be an ethologically relevant adaptation to light-dark asymmetries found in the natural world. Either way, it is difficult to extrapolate from asymmetric neuronal processing of light and dark signals to functional asymmetries in downstream processing, including behavior. In this study, we used the behavioral turning responses to measure asymmetries in the flies’ motion computation algorithm, instead of examining ON and OFF processing channels at the neuronal level. Since optomotor sensitivity to triplet spatiotemporal correlations is necessarily a functional consequence of underlying asymmetric visual signal processing, we could thus directly link light-dark asymmetries in natural scenes to the functional impact of ON-OFF asymmetric neural circuitry. It is similarly important to identify additional light-dark asymmetric behaviors that can clarify the functional role of other light-dark asymmetries in visual processing.

Having established the functional relevance of ON-OFF asymmetric visual processing, it is next important to find its neural implementation. Previous work has suggested that front-end nonlinearities could account for certain optomotor illusions in flies (Bülthoff and Götz, 1979), and it is conceivable that such nonlinearities could generate contrast asymmetric motion responses (Clark and Demb, 2016; Fitzgerald et al., 2011). However, several simple front-end nonlinearities can improve motion estimation without inducing the observed triplet correlation responses (Fitzgerald and Clark, 2015). Alternatively, nonlinear processing at the level of direction-selective T4 and T5 neurons could also generate the asymmetries we observed here. Indeed, differentially affecting T4 and T5 activity, either through direct silencing or by manipulating upstream neurons, alters the behavioral responses of flies to triplet correlations (Clark et al., 2014; Leonhardt et al., 2016), and parallel experiments in humans similarly find that contrast-asymmetric responses are mediated by neurons separately modulated by moving ON and OFF edges (Clark et al., 2014). Yet, it remains unclear whether asymmetric responses of T4 and T5 are inherited from upstream neurons. For instance, contrast adaptation could differ between the two pathways (Chichilnisky and Kalmar, 2002), and incompletely rectified inputs to T4 and T5 could generate asymmetrical responses to light and dark inputs (Salazar-Gatzimas et al., 2018). The weightings of T4 and T5 signals in downstream circuits could also result in contrast asymmetric phenomena. This rich landscape of possibilities motivates us to think that multiple mechanisms are likely to be involved. By measuring behavior and distilling the abstract algorithmic properties of the system, we will be able to constrain the contributions of individual circuit components without confining ourselves to an overly narrow class of mechanistic models.

Relating algorithm and implementation in fly visual motion estimation

David Marr famously asserted that neural computation needs to be understood at both the algorithmic and implementational levels (Marr and Poggio, 1976). The benefits of this dual understanding go both ways. On one hand, the brain is immensely complicated, and an algorithmic theory can provide an invaluable lens for making sense of its details. On the other hand, the nuances of neuronal implementation can lead to new algorithmic questions and mechanistically satisfying answers. Marr used the optomotor response of flies to articulate his philosophy over forty years ago, and the community is still leveraging this problem to unravel the subtle relationships between algorithm and mechanism in the brain. The HRC model provided the first algorithmic theory of fly visual motion estimation, and this model’s insights into the roles of spatial separation, differential time delays, and nonlinear signal integration have now been verified mechanistically (Arenz et al., 2017; Fisher et al., 2015; Haag et al., 2016; Leong et al., 2016; Salazar-Gatzimas et al., 2016; Takemura et al., 2017). They still provide the bedrock of our understanding. Yet the precise mathematical form and mechanistic origin of the nonlinearity remain controversial, with different papers pointing out compelling roles for membrane voltages, intracellular calcium signals, and ON-OFF pathways (Badwan et al., 2019; Gruntman et al., 2018; Haag et al., 2016; Leong et al., 2016; Leonhardt et al., 2016; Salazar-Gatzimas et al., 2018; Wienecke et al., 2018). None of this complexity invalidates the core insights of the HRC, nor does the HRC’s domain of success warrant apathy toward the fundamental importance of these unexpected findings. Instead of algorithm and mechanism providing parallel or hierarchical goals, they should be treated as parts of one integrated understanding of the circuit.

Materials and methods

Fly husbandry

Request a detailed protocol

Flies were grown at 20°C, 50% humidity in 12-hr day/night cycles on a dextrose-based food. Flies used for the behavioral experiment were non-virgin wildtype (D. melanogaster: WT: +; +; +) females between 24 and 72 hr old.

Psychophysics

Request a detailed protocol

The fly’s turning behavior was measured with the fly-on-a-ball rig, as described in previous studies (Clark et al., 2011; Creamer et al., 2018). The fly was tethered above a ball floating on a cushion of air. The ball served as a treadmill such that the fly could walk and turn while its position and orientation were fixed. The rotational response of the fly was the averaged rotation magnitude of the ball in 1/60s bins with an angular resolution of ~0.5°. Panoramic screens surrounded the fly, covering 270° horizontally and 106° vertically (Creamer et al., 2019). A Lightcrafter DLP (Texas Instruments, USA) projected visual stimulus to the screens with chrome green light (peak 520 nm and mean intensity of 100 cd/m2). The spatial resolution of the projector was around 0.3° and the projector image was updated at 180 Hz. The rig’s temperature was 34-36°.

Visual stimuli

Request a detailed protocol

Visual stimuli varied along the horizontal axis in 5° pixels and were uniform along the vertical dimension. Since the panoramic screen was 270° wide, the horizontal axis was divided into 270/5=54 pixels, so the screen was divided into 54 vertical bars.

We used two types of binary stochastic visual stimuli for kernel extraction, a three-bar-block stimulus type, and a four-bar-block stimulus type. In the 3 (4)-bar-block stimuli, each block contained 3 (4) neighboring vertical bars that flickered white or black independently in space and time. The identical blocks then repeated periodically around the fly. Since there are 54 bars, the entire visual field was divided into 18 (14.5) blocks. Each bar updated its contrast every 1/60 second.

We used third-order glider stimuli (Hu and Victor, 2010) to directly measure the fly’s sensitivity to three-point correlations. Third-order glider stimuli are binary patterns of black and white pixels. In each glider stimulus, one can enforce a three-point spatiotemporal correlation. Here, we considered only three-point spatiotemporal correlations that involved two neighboring points in space. We described the specific configuration of each glider with a four-parameter scheme, (Δτ31,Δτ21,L\R,P). We defined the 1st point to be the more recent one of the two points sharing a spatial location, while the other point at this spatial location was defined to be the 3rd point. The final point, which was in a position adjacent to the 1st and 3rd points, was defined to be the 2nd point. The temporal interval between the 2nd point and 1st point was denoted as Δτ21, and Δτ31 was defined similarly. For example, Δτ31=1  means that the 3rd point is 1 frame (16 ms) before the 1st point. Although Δτ31 is positive by definition, Δτ21 can be positive or negative. We used L and R to indicate whether the 1st and the 3rd points are on the left or right of the 2nd point. As detailed previously (Clark et al., 2014; Hu and Victor, 2010), in positive parity gliders (P=+1) one or three of these three points are white, whereas negative parity gliders (P=-1) have one or three of the points black. We illustrated the configuration of each glider using a 'ball-stick' diagram, where the x-axis represents space, the y-axis represents time, time runs downward, and the plus (minus) sign denotes the polarity of the glider (Figure 4—figure supplement 1A). Overall, we presented 52=13×2×2 different stimuli: thirteen different temporal intervals, each with two directions and two polarities (Table 1).

HRC model

Request a detailed protocol

We constructed a classical Hassenstein-Reichardt correlator (HRC) model (Hassenstein and Reichardt, 1956). The output rHRC(t) was defined as

rHRC(t)=[s1fHRC][s2gHRC][s2fHRC][s1gHRC],

where s1(t) and s2(t) denote the contrast signals from two spatial locations, * denotes convolution in time, and fHRC(t), gHRC(t) are the temporal filters of the delay line and the non-delay line. In particular,

fHRC(t)=t exp(tτHRC),

for t0, fHRC(t)=0 for t<0, and

gHRC(t)=ddtfHRC(t),

where τHRC=20 ms.

Modeling the fly’s motion computation algorithm with Volterra kernels

Request a detailed protocol

We approximated the fly’s motion computation algorithm with second- and third-order Volterra kernels. We provide a detailed description of this model in Section 1 of Appendix 1. In brief, we discretize space into pixels with Δx=5° resolution, discretize time into time bins with Δt=1/60 s resolution, and index locations in space with an integer subscript ξ. We modeled the response of the fly r(t) as the sum of an array of elementary motion detectors (EMDs) acting at each position in space:

r(t)=ξrξ(t),

where rξt denotes the response of EMD at spatial location ξ. That term is itself the sum of the second-order response rξ(2)t and third-order response rξ(3)t:

rξt=rξ(2)t+rξ(3)t.

The second- and third-order responses are defined as follows (see also Appendix 1):

rξ(2)t=2τ1,τ2KLR2τ1,τ2sξt-τ1sξ+1t-τ2Δt2,
rξ(3)t=3τ1,τ2,τ3KLRL3τ1,τ2,τ3sξt-τ1sξ+1t-τ2sξt-τ3-sξ+1t-τ1sξt-τ2sξ+1t-τ3Δt3,

where sξt and sξ+1t denote the visual inputs that the EMD at spatial location ξ receives. (In Figure 2B and Figure 2—figure supplement 1, we used sL(t) and sR(t) to represent sξt and sξ+1t). The second-order response rξ(2)t is the sum of second-order features, sξt-τ1sξ+1t-τ2, weighted by the second-order kernel, KLR2τ1,τ2. The third-order response, rξ(3)t, is the sum of third-order features, sξt-τ1sξ+1t-τ2sξt-τ3, and sξ+1t-τ1sξt-τ2sξ+1t-τ3, weighted by the third-order kernel, KLRL3τ1,τ2,τ3. Note that we use the notation KLLR3 in Appendix 1, which can be transformed into KLRL3 by interchanging the position of the second and the third spatial and temporal arguments. In particular, KLRL3τ1,τ2,τ3=KLLR3τ1,τ3,τ2.

Measuring Volterra kernels with stochastic stimuli and reverse-correlation

Request a detailed protocol

To estimate the kernels, we presented stochastic binary stimuli to flies and reverse-correlated the corresponding response with the input. In particular, we estimated the second-order kernel by reverse-correlating the mean-subtracted turning response with the products of two points in space and time (Appendix 1),

K^LR(2)(τ1,τ2)=12(K^LR3(2)(τ1,τ2)+K^LR4(2)(τ1,τ2)),

where K^LR3(2)(τ1,τ2) is the estimated second-order kernel from the three-bar-block stimulus and K^LR4(2)(τ1,τ2) is the estimated second-order kernel from the four-bar-block stimulus. In particular,

K^LR3(4)(2)(τ1,τ2)=15412γstim4(Δt)2ξ=1,2,3,(4)1Ttrturn(t)sξ(tτ1)sξ+1(tτ2),

where rturn(t) is the mean-subtracted response, and γstim is the magnitude of the contrast in the binary stimulus (Appendix 1). s1(t),s2(t),s3(t) represent the contrasts of 3 independent bars in 3-bar-block stimulus, and sξ+1t-τ2s1t-τ2 for ξ=3. Similarly, s1t, s2t, s3t, s4t represent the contrasts of 4 independent bars in four-bar-block stimulus, and sξ+1t-τ2s1t-τ2 for ξ=4. We presented three-bar-block stimulus to 35 flies, and four-bar-block stimulus to 37 flies, for T=20 min. All fly kernel estimates were averaged to generate the final kernel estimate. Note that we enforce K^LR(2)(τ1,τ2)=0 when τ1=τ2, because we model the visual motion estimator as a mirror-antisymmetric operator (Appendix 1).

Similarly, we estimated the third-order kernel by reverse-correlating the mean-subtracted turning response with the products of three points in space and time (Appendix 1).

K^LRL(3)(τ1,τ2,τ3)=12(K^LRL3(3)(τ1,τ2,τ3)+K^LRL4(3)(τ1,τ2,τ3)),
K^RLR(3)(τ1,τ2,τ3)=12(K^RLR3(3)(τ1,τ2,τ3)+K^RLR4(3)(τ1,τ2,τ3)),

where

K^LRL3(4)(3)(τ1,τ2,τ3)=154161γstim6(Δt)3ξ=1,2,3,(4)1Ttrturn(t)sξ(tτ1)sξ+1(tτ2)sξ(tτ3),
K^RLR3(4)(3)(τ1,τ2,τ3)=154161γstim6(Δt)3ξ=1,2,3,(4)1Ttrturn(t)sξ+1(tτ1)sξ(tτ2)sξ+1(tτ3),

and τ1τ3.

We then enforced mirror anti-symmetry by,

K^LRsym(2)(τ1,τ2)=12(K^LR(2)(τ1,τ2)K^LR(2)(τ2,τ1)),
K^LRLsym(3)(τ1,τ2,τ3)=12(K^LRL(3)(τ1,τ2,τ3)K^RLR(3)(τ1,τ2,τ3)),
K^RLRsym(3)(τ1,τ2,τ3)=K^LRLsym(3)(τ1,τ2,τ3),

where K^LRsym(2)(τ1,τ2), K^LRLsym(3)(τ1,τ2,τ3) and K^RLRsym(3)(τ1,τ2,τ3) are the symmetrized kernels, and we refer to them as the mirror anti-symmetric component.

We next evaluated how much variance in the fly’s turning behavior can be explained by the estimated second- and third-order kernels. We presented the same stochastic stimulus sequence to many flies, so we averaged the turning response from different flies, denoted as r¯turn(t), to estimate the true stimulus-associated turning response rstim-driven(t). We predicted the turning response rpred(t) to the same stimulus sequence,

rpred(t)=rpred(2)(t)+rpred(3)(t),

where rpred(2)(t)(rpred(3)(t)) is the predicted response from the second-order (third-order) kernel. We calculated rpred(t) using only the anti-symmetric component of the kernel, that is K^LRsym(2)(τ1,τ2) and K^LRLsym(3)(τ1,τ2,τ3). The Pearson correlation between rpred(t) and r¯turn(t) were 0.686 and 0.787 in three-bar and four-bar experiments.

If fly turning responses are driven only by visual stimuli, are mirror anti-symmetric, and use only second- and third-order correlations, then the measured second- and third-order kernel would explain all the variance in the stimulus-driven turning responses, and the Pearson correlation between rpred(t) and rstim-driven(t) should be one. However, our measured kernels only explained about half of the variance. There are several potential reasons for this. First, the turning responses of flies appeared very noisy, making it difficult to estimate the true stimulus-driven response. That is, r¯turn(t) was a poor estimation of rstim-driven(t). Second, we wanted to minimally extend the canonical second-order motion detector while being able to account for light-dark asymmetric visual processing, so we added only one more term, the third-order kernel. However, the fly might respond to higher order spatiotemporal correlations in visual inputs, and our model did not capture them.

Representing the second- and third-order kernels in the impulse response format

Request a detailed protocol

To better understand and visualize the extracted kernels, we rearranged the elements in the kernels such that we could interpret kernels as the impulse response to a pair (triplet) of contrasts. This is analogous to the impulse response to a single contrast change at one point in a linear system.

Before rearrangement, the rows (columns) of the second-order kernel represent the temporal argument τ1 (τ2) in the matrix K^LRsym(2)(τ1,τ2). After rearrangement, the rows correspond to the time since the more recent point, and the columns represent different temporal intervals between the two points, with negative intervals meaning that the right point is more recent than the left point (Figure 4—figure supplement 2D) (Salazar-Gatzimas et al., 2016). We denote this new format as

K^impulse(2)(τ,Δτ21)K^LRsym(2)(τ,τ+Δτ21),

where Δτ21=τ2-τ1 and τ=minτ1,τ1+Δτ21. Because we have enforced mirror anti-symmetry in K^LRsym(2), the columns of K^impulse(2)(τ,Δτ21) are anti-symmetric around Δτ21=0. We interpreted the columns of K^impulse(2)(τ,Δτ21) as the impulse response of the fly to a pair of adjacent contrast changes separated by Δτ21 in time.

Similarly, before rearrangement, the three dimensions of the third-order kernel represent the three temporal arguments of K^LRLsym(3)(τ1,τ2,τ3). Once rearranged, we define

K^impulse(3)(τ,Δτ21,Δτ31)K^LRLsym(3)(τ,τ+Δτ21,τ+Δτ31),

where τ=min(τ1,τ1+Δτ21,τ1+Δτ31), Δτ21=τ2-τ1, and Δτ31=τ3-τ1. Rows again represent the time since the last point, the columns represent the temporal distance between the more recent point on the left and the sole right point, and the third tensor dimension represents the temporal distance between two left points (Figure 4B). For this third-order kernel, we also summed along the rows for 0.75 s to define the summed kernel strength (Figure 4C),

K^summed(3)(Δτ21,Δτ31)=τ<0.75K^impulse(3)(τ,Δτ21,Δτ31).

Testing the significance of the measured third-order kernel with ‘null kernels’

Request a detailed protocol

We tested the significance of the measured kernel with synthetic null kernels (Figure 4C). We shifted the stimulus with 100 random temporal offsets (the offset was at least 2 seconds long), reverse-correlated these shifted stimuli with responses, and generated 100 synthetic null kernels. The 100 kernels extracted from the misaligned stimulus and response were used to test the significance of the real kernel. We calculated the summed kernel strength of these 100 null kernels, and built the null distribution of summed kernel strength and performed two-tailed z-test. We tested kernel strength in the region of the kernels: τ3-τ1 from 0 to 250 ms, and τ2-τ1 from – 250 to 250 ms, which equaled 528=16×33 kernel strengths in total. There are 43 significant (p <0.05) responses, and around 23% of the total significant points (10 in total) aggregated when |τ1τ2|<83 ms, |τ3τ1|<83 ms. Therefore, we further simplified our kernel by setting third-order kernel elements to zero when |τ1τ2|83 ms or |τ3τ1|83 ms, and denoted the 'cleaned' kernel as K^LRLsymclean(3)(τ1,τ2,τ3). To be consistent, we also set elements of the second-order kernel to zero when |τ1τ2|>83ms, and denoted it as K^LRsymclean(2)(τ1,τ2).

The exact p-values for the summed third-order strength were shown for τ2τ1=[133ms,133ms] with 16.7 ms increment in the following table from top to bottom.

Δτ31=16 msΔτ31=33.3 msΔτ31=50 ms
0.43280.75060.3948
0.97840.05140.1870
0.73810.05870.0904
0.33400.96500.5661
0.44350.68010.6273
0.21540.80810.8081
0.06450.09420.0641
0.00080.70770.0015
<0.00010.30140.9246
<0.00010.08230.9375
0.8873<0.00010.0149
0.01770.31120.0001
0.10110.04870.2435
0.25650.30860.5983
0.63280.50570.7153
0.10370.48150.3233
0.36770.20750.6670

Comparing the measured third-order kernel with the glider responses

Request a detailed protocol

We measured the fly’s sensitivity to three-point correlations using a suite of third-order glider stimuli. Overall, we presented 52=13×2×2 different stimuli (See Visual stimuli in Materials and methods, Table 1, Figure 4—figure supplement 1A). Each glider stimulus elicited sustained turning responses (Clark et al., 2014), so we averaged the response over time and denote it as r(Δτ31,Δτ21,L/R,P)glider, where the subscript specifies the stimulus type. Since we assume the fly’s motion computation is mirror anti-symmetric, we subtracted responses to the pairs of gliders with different directions but with the same temporal interval and polarity, and denote it as r(Δτ31,Δτ21,P)glider,

r(Δτ31,Δτ21,P)glider=12(r(Δτ31,Δτ21,O,P)gliderr(Δτ31,Δτ21,O,P)glider),

where O denotes the orientation of the glider (i.e. left or right), and O denotes the opposite orientation. We plotted 18 out of 26 averaged responses in Figure 4—figure supplement 1B.

Table 1
Statistics of responses to third-order glider stimuli with different spatiotemporal structures.
IndexΔτ31 (16 ms)Δτ21 (16 ms)(nP=1,nP=1)(pP=1,pP=-1)
110(18, 12)(0.0003, <0.0001)
220(35, 29)(0.0299, 0.0003)
330(14, 8)(0.4218, 0.0092)
440(14, 8)(0.0201, 0.3552)
511(18, 13)(<0.0001, <0.0001)
622(35, 30)(0.0026, <0.0001)
733(14, 9)(0.2323, 0.0875)
844(14, 9)(0.6778, 0.1700)
91-1(8, 8)(<0.0001, 0.0044)
1021(8, 8)(0.0041, 0.0617)
1112(8, 8)(0.0044, 0.6713)
1231(21, 21)(0.5396, 0.4470)
1332(21, 21)(0.1042, 0.0203)

In Table 1, we listed the number of flies tested for each glider (nP=1 is the number of flies tested with positive gliders, nP=-1 with negative gliders), and the p-values of Student t-tests, which were tested against zero response (pP=1 is the significance level for positive gliders, pP=-1 for negative gliders).

The measured third-order kernel and the measured glider responses should both reflect the fly’s sensitivity to three-point correlations. To test agreement between these two measurements, we used the measured third-order kernel to predict the fly’s responses to glider stimuli. We made the prediction by summing the ‘diagonal line’ of the third-order kernel. Specifically, we found the predicted response to specific third-order gliders by summing over all elements in the kernel with the same temporal differences as the glider:

r(Δτ31,Δτ21)(3)-pred=54×6 ×γstim6τKLRLsym(3)(τ,τ+Δτ21,τ+Δτ31)(Δt)3.

The constant of 54×6 takes into consideration the spatial summation all 54 putative EMDs and all six parts of third-order kernel in one EMD (Appendix 1), and γstim=1 is the contrast of the glider stimuli. For gliders who have two points on the right side, we used KRLRsym instead of KLRLsym. Since the third-order kernel is agnostic to the polarity of the three-point correlations and reflected only the average of the fly’s sensitivity to positive and negative correlations, we averaged the responses of positive and negative gliders. This neglects higher-order components of glider responses that could nevertheless be biologically meaningful.

r(Δτ31,Δτ21)glider-ave=12(r(Δτ31,Δτ21,1)gliderr(Δτ31,Δτ21,1)glider).

We then compared the r(Δτ31,Δτ21)glider-ave with r(Δτ31,Δτ21)(3)-pred (Figure 4D).

Comparing the temporal structure of the second- and third-order kernels

Request a detailed protocol

To compare the temporal structure of the two kernels, we first rearranged and combined the elements in the third-order kernel into a two-dimensional representation and then rearrange the second-order kernel in the impulse response format. Specifically, we rearranged the elements in the third-order kernel into K^alignedLRL(3)(τ, ΔτLR,ΔτLL), where τ corresponds to the time since the most recent point, ΔτLR is the average time difference between left points and the right point, and 2ΔτLL is the temporal separation between the two left points (Figure 4—figure supplement 2A left). In particular,

K^alignedLRL(3)(τ, ΔτLR,ΔτLL)K^LRLsym(3)(τ2+ΔτLRΔτLL,τ2,τ2+ΔτLR+ΔτLL ),

where τ=min({τ2+ΔτLRΔτLL,τ2,τ2+ΔτLR+ΔτLL}). We similarly defined

K^alignedRLR(3)(τ, ΔτRL,ΔτRR)K^RLRsym(3)(τ2+ΔτRLΔτRR,τ2,τ2+ΔτRL+ΔτRR ),

where τ corresponds to the time since the most recent point, ΔτRL is the average time difference between right points and the left point, and 2ΔτRR is the temporal separation between the two right points (Figure 4—figure supplement 2A right). Finally, we summed over the within-point time differences (ΔτRR and ΔτLL), and summed these two pieces to obtain a matrix (Figure 4—figure supplement 2B),

K^align2D(3)(τ, ΔτRL)=2ΔτLL<9ΔtK^alignedLRL(3)(τ, ΔτRL,ΔτLL)+2ΔτRR<9ΔtK^alignedRLR(3)(τ, ΔτRL,ΔτRR),

where Δt=1/60 s.

K^align2D(3)(τ, ΔτRL) has rows and columns that are conceptually comparable to those of K^impulse2τ,Δτ21, as rows represent times since the most recent point and columns describe the temporal distance between right and left points. However, in K^impulse(2) the columns are spaced by 16.67 ms Figure 4—figure supplement 2D), whereas in K^align2D(3) the columns are spaced by 8.33 ms (Figure 4—figure supplement 2B). This results from the fact that Δτ21 is an integer number of frames in K^impulse(2), whereas 2ΔτRL is an integer number of frames in K^align2D(3). We thus averaged two neighboring elements in K^align2D(3) (Figure 4—figure supplement 2C), so that it has the same resolution as the K^impulse(2).

We then summed both K^align2D(3) and K^impulse(2) in each column, and we rescaled the two summed kernels so that the norm of each was 1 (Figure 4E). In order to ease the visual comparison of the temporal structure between the two kernels, we also flipped the sign of the summed K^align2D(3).

Comparing the second- and third-order kernels with the singular value decomposition (SVD)

Request a detailed protocol

We factorized K^align2D(3) and K^impulse(2) into the products of a set of basis vectors with SVD (Figure 4—figure supplement 2EFG),

K^impulse2=U(2)(2)V(2)T,
K^align2D(3)=U(3)(3)V(3)T,

where U(i), Σi,V(i), are the left-singular vectors, singular values, and right-singular vectors of the associated ith-order kernel. We use u1(2),v1(2)  (u1(3),v1(3)) to denote the left and right singular vectors corresponding to the largest singular values. For visualization purposes, in Figure 4—figure supplement 2G, we flipped the sign of v1(3) so that readers could visually compare the temporal structure of these two vectors.

Extracting kernels of various motion detectors that were optimized to predict image velocity in natural scenes

Request a detailed protocol

We characterized four other motion detectors (Figure 4—figure supplement 3) with Volterra kernels (Fitzgerald and Clark, 2015). These motion detectors have various physiological plausible structures and were optimized to predict image velocities in natural scenes. We fed the same stochastic binary stimuli sequence to these motion detectors, collected the corresponding responses, and extracted the second and third-order kernels using reverse-correlation. In Figure 4F, we presented the summed kernel strength of the third-order kernels. Note that we only presented several examples, and the spatiotemporal arguments of these examples are represented graphically.

Natural scene dataset

Request a detailed protocol

We used a natural scene dataset (Meyer et al., 2014), which contains 421 panoramic luminance-calibrated naturalistic two-dimensional pictures. Each picture has 927×251 pixels and subtends 360° horizontally and 97.4° vertically, so that the spatial resolution is ~0.30°/pixel. In our study, we used 1-dimensional images, which were single rows from the two-dimensional pictures. Therefore, there were 105671=421×251 images in the dataset. We refer to the two-dimensional scenes as pictures or photographs, and refer to the one-dimensional slices as images.

Preprocessing photographs

Request a detailed protocol

To simulate the photoreceptors, we converted the luminance pictures into contrast pictures with a blurring step and contrast computation step.

First, to simulate the spatial resolution of the fly’s ommatidia, we blurred the original photograph (Figure 4—figure supplement 1B), denoted by I, with a two-dimensional Gaussian filter fblur(x,y).

Iblur(x,y)=Ifblur=u,vI(x+u,y+v)fblur(u,v),
fblur(u,v)=12πλblur2exp(u2+v22λblur2), uuu+, vvv+.

where denotes cross-correlation. The filter extends to ±3λblur, that is u+=v+=|u|=|v|=3λblur, where λblur is related to full-width-at-half-maximum (FWHM) by λblur=FWHMblur22ln2, and we chose FWHMblur=5.3° (Stavenga, 2003). The original pictures cover the full circular range horizontally, but only 97.4° vertically. When the range of the filter extended beyond the vertical boundary of the picture, we padded the picture by 'vertical reflection'. This reflection padding was also used when we calculated the local mean-luminance. In Figure 1B, where we demonstrated one example picture, we did not perform this blurring step in order to preserve high spatial acuity such that it is pleasing for human eyes.

Second, we converted the luminance signals in the blurred photograph to contrast signals (Figure 1—figure supplement 1CD) (Fitzgerald and Clark, 2015),

c(x,y)=Iblur(x,y)Imean(x,y)Imean(x,y),
Imean(x,y)=Iblurflocal-mean,

where cx,y is the contrast at each location x,y. Imean is the local mean luminance, which is the averaged luminance weighted by a two-dimensional Gaussian spatial filter flocal-mean. The length scale of flocal-mean can be equivalently described by λlocal-mean and FWHMlocal-mean, where λlocal-mean=FWHMlocal-mean22ln2. We swept FWHMlocal-mean from 10° to 75° (Figure 3—figure supplement 2ABC).

Alternatively, we also computed the local mean luminance over time instead of over space in Figure 3—figure supplement 2FG,

Imean(x,y,t)=uIblur-time(x,y,t)flocal-mean-time(tu),

where Iblur-time(x,y,t) is the simulated naturalistic moving luminance signal (Fitzgerald and Clark, 2015), and the local mean luminance was the averaged luminance signals over time, with the temporal filter flocal-mean-time(t)=exp(t/τlocal-mean), where flocal-mean-time(t) is normalized to have a sum of 1. We swept τlocal-mean from 10 ms to 500 ms (Figure 3—figure supplement 2FG).

Depending on the parameters in local mean computation, we had 20 natural scene datasets, including 14 datasets whose local mean luminance was computed statically and 6 dynamically. Unless specified, we used the natural scene dataset with contrast images preprocessed statically with FWHMlocal-mean=25°.

Eliminating the higher order structure of natural scene ensemble

Request a detailed protocol

We created a synthetic image dataset where we effectively preserved only the second-order structure of natural scenes ensemble and eliminated the higher-order structure. We viewed each 1D image as a random vector determined by PnaturalX, where X is a n-dimensional vector (n=927 for 927 pixels). The covariance matrix of PnaturalX determines the point variance and pairwise spatial correlations in natural scenes. We intended to construct a Gaussian distribution PsyntheticX such that its covariance matrix is the same as PnaturalX. In this way, the image ensemble sampled from PsyntheticX will contain the same second-order statistics as the natural scene ensemble, but it would lack the higher order statistics present in the natural scene ensemble.

Because the pixels in images are horizontally translational invariant, the covariance matrix of  PnaturalX is a circulant matrix and can be diagonalized by a discrete Fourier transform. Therefore, we constructed PsyntheticX  and generated the synthetic dataset in the frequency domain (Bialek, 2012). Operationally, we first performed a discrete Fourier transform (with fft function in Matlab v2018a, RRID:SCR_001622) on each one-dimensional image in the natural scene dataset and obtained its Fourier domain representation y(i)=(yk1i,yk2i,,ykn(i)), where (i) denotes the ith images, and kn denotes the knth Fourier component. We then calculated the average power of frequency kn, denoted as σkn2, where σkn2=1mimykn(i)ykn(i)* , and * denotes complex conjugation. At each frequency kn, we built two Gaussian distributions G𝒩(0, σkn22) and G𝒩(0, σkn22). To generate one synthetic image, we sampled two real numbers from these two distributions as the real and imaginary part of Fourier component of the synthetic image at frequency kn. In the end, we performed an inverse Fourier transform to the sampled Fourier components to gain a synthetic image in the spatial domain. In total, we generated 1000 high-order-structure-eliminated synthetic images, and refer to this synthetic image dataset to as the dataset in which higher order structure was eliminated.

Computing and manipulating statistics of individual one-dimensional images

Request a detailed protocol

The natural scene dataset was comprised of an ensemble of heterogeneous images, and the statistics of different images can be drastically different from each other. Thus, we considered each one-dimensional image to have its own contrast distribution, Ppixel(i)(X), where i indexes the image and the contrast of each pixel in an image as an independent sample of the random variable X. For each natural image, we computed its sample mean, sample variance, sample skewness, and sample kurtosis of pixels, and show the histogram of these statistics (Figure 5—figure supplement 1 and Figure 5B).

We generated 14=2+10+2 synthetic image datasets to mimic various statistical properties of natural scenes (Table 2). These 14 datasets differ in three parameters: (1) the contrast range; (2) whether the contrast skewness was constrained; and (3) the specific value of the imposed skewness when the skewness was constrained. We conceptually justify this image synthesis method in Appendix 2, and here we focus on methodological details. For every image in the natural scene dataset, we generated a corresponding image in each synthetic image dataset. This involved two steps.

Table 2
Parameters for synthetic image datasets.
Index of datasetImposed meanImposed varianceImposed skewnessContrast rangeDiscrete levels
MED-1cμcσ2NA[cμδc,cμ+δc]32
MED-2cμcσ2cγ[cμδc,cμ+δc]32
MED-30cσ21.25[-δc,δc]32
MED-40cσ21[-δc,δc]32
MED-50cσ20.75[-δc,δc]32
MED-60cσ20.5[-δc,δc]32
MED-70cσ20.25[-δc,δc]32
MED-80cσ2−0.25[-δc,δc]32
MED-90cσ2−0.5[-δc,δc]32
MED-100cσ2−0.75[-δc,δc]32
MED-110cσ2-1[-δc,δc]32
MED-120cσ2−1.25[-δc,δc]32
MED-130cσ2NA[-2.5, 2.5]512
MED-140cσ2cγ[-2.5, 2.5]512

In step one, we determined all relevant image statistics and generated the corresponding maximum entropy distribution (MED). Operationally, for each individual image, we found its contrast range, [cmin,cmax], the largest contrast magnitude, δc=max(|cmin|,|cmax|), and the sample mean, cμ. We then derived the contrast range specified in Table 2, binned the range into N discrete levels, calculated the contrast frequency at each level to estimate Ppixel(i)(X), and estimated the contrast variance cσ2 and skewness cγ from this estimated distribution. Finally, we solved the MED (Appendix 2) with the constrained statistics specified in Table 2. In the 'imposed skewness' column, 'NA' means that the skewness was not constrained in the MED.

In step two, we generated the synthetic image. The solved MED captures the pixel statistics but does not capture any spatial correlations between pixels. Therefore, we decided to interpolate between sampled pixels to coarsely mimic the spatial correlations. Operationally, for an individual image, we calculated its spatial correlation function and found the cutoff distance, Δxα, where the correlation falls below α=0.2. We then sampled contrast values independently from the solved MED and placed these contrasts Δxα pixels apart. Finally, we up-sampled the low-resolution image to a high-resolution image using the resample function in Matlab v2018a (RRID:SCR_001622).

Note that MED-3 through MED-7 are theoretically related to MED-8 through MED-12 by contrast inversion. We thus generated synthetic datasets MED-8 to MED-12 by simply inverting the contrast of images in dataset MED-3 to MED-7.

Simulating naturalistic motion with natural scenes

Request a detailed protocol

To simulate the full-field motion signals induced by self-rotation in the natural environment, we rigidly translated images at various horizontal velocities (Fitzgerald and Clark, 2015). For each trial, we randomly chose one 1-dimensional image from the dataset. We had 35 image datasets in total, including 20 natural scene datasets preprocessed with different parameters and 15 synthetic image datasets, we therefore built 35 naturalistic motion datasets, where all images were sampled from a particular image dataset in each motion dataset. We sampled the velocity in two ways. First, we sampled it from a Gaussian distribution with zero mean and standard deviation of 114 °/s. This standard deviation is the measured standard deviation of the spontaneous rotational turning speed of freely walking flies. In Figure 3—figure supplement 1, we varied the standard deviation of the Gaussian distribution from 32 °/s to 512 °/s. Second, we selected an image velocity at discrete values ranging from 0 °/s to 1000 °/s with a 10 °/s interval. Given an image-velocity pair, we rigidly move this image with this velocity for one second. The temporal resolution is 60 Hz.

To eliminate any potential left-right asymmetry in naturalistic motion datasets, if we moved an image rightward at a certain speed, we always simulated a paired trial in which the same image was flipped horizontally and moved leftward with the same speed.

Calculating the responses of different motion detectors to naturalistic motion signals

Request a detailed protocol

Our study concerns two motion detectors: the HRC and the measured kernels. For the HRC, we created an array of 54 overlapping HRC elementary motion detectors which extended 270° horizontally. We calculated the instantaneous HRCs’ outputs at the end of each trial and averaged them across space to get the model’s output rHRC. For the measured kernels, we calculated the instantaneous output of the kernels at the end of each trial, including the second-order response, r(2), the third-order response, r3, and the total response, r=r(2)+r(3).

In Table 3, we listed the simulation parameters for each figure, including the motion detector, the image dataset, the velocity distribution, and the number of trials (image-velocity pairs). If the dataset is natural scenes, we also listed its preprocessing parameter FWHMlocal-mean. If the velocity distribution is Gaussian, we listed its standard deviation.

Table 3
Natural scene datasets for naturalistic motion simulations.
Image dataset (FWHMlocal-mean/τlocal-mean)Velocity distribution (σvel, or discrete values, °/s)Number of trialsMotion
Detector
Figure 1C greenNatural scene (25°)Discrete [0:10:1000]1000 each velocityHRC
Figure 1C purpleSynthetic-higher order structure eliminatedDiscrete [0:10:1000]1000 each velocityHRC
Figure 3Natural scene (25°)Gaussian (114)10000Fly
Figure 3—figure supplement 1Natural scene (25°)Gaussian (32, 64, 128, 256, 512)8000 each velocity
distribution
Fly
Figure 3—figure supplement 2DENatural scene (10° ~ 75°)Gaussian (114)8000 each FWHMlocal-meanFly
Figure 3—figure supplement 2FGNatural scene (10 ~ 500 ms)Gaussian (114)8000 each τlocal-meanFly
Figure 5CDE green,
Figure 5—figure supplement 3A
Synthetic-MED-1Gaussian (114)8000Fly
Figure 5CDE brown
Figure 5—figure supplement 3B
Synthetic-MED-2Gaussian (114)8000Fly
Figure 5FGH
Figure 5—figure supplement 3CD
Synthetic-MED 5–14Gaussian (114)8000 each datasetFly
Figure 5—figure supplement 3G blueSynthetic-MED-13Gaussian (114)8000Fly
Figure 5—figure supplement 3G redSynthetic-MED-14Gaussian (114)8000Fly

For example, in Figure 3, we predicted the motion estimates of the measured kernel to the naturalistic motion. The naturalistic motion was created with images sampled from the natural scene dataset that has preprocessing parameter FWHMlocal-mean=25°, and created with velocity sampled from the Gaussian distribution that has standard deviation of 114 °/s. We simulated 10,000 independent trials, and had 20,000 trials after enforcing left-right symmetry.

Evaluating the performance of a motion detector

Request a detailed protocol

We denote the true velocity of the image as vimg and the response of a motion detector as vest. We assessed the accuracy of the motion estimation using two metrics. First, we measured the Pearson correlation ρ between vimg and vest,

ρ=cov(vest,vimg)var(vest)var(vimg),

where the variances and covariances are evaluated across any of the naturalistic motion datasets introduced above. To estimate the uncertainty in  ρ induced by finite sample sizes, we randomly separated all independently simulated trials into 10 groups, calculated the Pearson correlation for each group, and estimated the SEM of the Pearson correlation across the groups. Second, when the image velocity was sampled at discrete values, we measured the variance of vest conditional on each possible image velocity, var(vest|vimg=v0).

Evaluating the improvement added by the third-order response

Request a detailed protocol

To evaluate how much improvement was added by the third-order response to the second-order response, we calculated the relative Pearson correlations: improvement=ρ(2+3)-ρ(2)ρ(2). As in Evaluating the performance of a motion detector, to estimate the uncertainty induced by finite sample sizes, we separated all trials into 10 groups, calculated the Pearson correlation for each group, calculated the improvements in each group, and estimated the SEM of the improvements across the groups.

Assessing the empirical weighting of the second-order and third-order responses

Request a detailed protocol

We modeled the image velocity as a linear combination of the second-order and third-order responses

vimg=α(2)r(2)+α(3)r(3)+ϵimg,

and estimated the optimal weighting coefficients, (α^(2),α^(3)), using ordinary least square regression. We calculated ρbest=cov(rbest,vimg)var(rbest)var(vimg), where rbest=α^(2)r(2)+α^(3)r(3). We computed the relative weighting coefficient as w=α^2/α^3.

Calculating the residual of second-order response

Request a detailed protocol

The second-order response can be viewed as a function of the image velocity, as well as noise that depends on image structure:

r(2)=β(2) vimg+ϵ(2).

We estimated the noise term as

ϵ^(2)=r(2)β^(2)vimg,

where β^(2) minimized the squared residual and ϵ^(2) denotes the estimated residual (noise).

Appendix 1

Modeling the fly's optomotor turning response with Volterra kernels

The visual motion system of the fly can generally be considered as a nonlinear mapping from spatiotemporal visual stimuli, denoted s(x,t), to behavioral turning responses, denoted rt. In general, any time-invariant nonlinear system can be written with a Volterra series,

rt=H(0)+H(1)s(x,t)+H(2)s(x,t)+H(3)s(x,t)+,

where each H(n) is a convolutional operator,

H(n)sx,t=dη1dτ1dηndτnKnη1,,ηn,τ1,,τnsη1,t-τ1sηn,t-τn,

ηi are spatial coordinates, τi are temporal coordinates, and K(n) is termed the nth-order Volterra kernel (Marmarelis and McCann, 1973; Schetzen, 1980). The system is fully specified by these kernels. In this section, we describe how we use this framework to model the fly’s visual system. We make several simplifying assumptions about the fly’s visual system.

Assumption 1. The system is third-order. This implies, Knη1,,ηn,τ1,,τn=0 for n>3. This choice aims to account for the system’s potential ability to asymmetrically process light-dark signals while minimally extending the canonical second-order algorithm.

Assumption 2. The system is spatially-invariant. This implies that

Knη1,,ηn,τ1,,τn= Knη1+δη,,ηn+δη,τ1,,τn.

Assumption 3. The system consists of a spatial array of elementary motion detectors (EMDs), and each EMD responds to inputs that are close in space. Since the spatial distance between neighboring ommatidia in fruit fly is around 5°, we discretize space into pixels with Δx=5° resolution, and we use integer indices to denote the relative spatial location. We further assume that the fly EMD does not respond to interactions between points that are spaced more than Δx apart,

K(n)(η1,,ηn,τ1,,τn)=0, |ηiηj|>Δx.

This modeling simplification is frequently made in fly neuroscience (Buchner, 1976), but modern receptive field measurements in individual motion detectors in Drosophila suggest that they integrate over a wider field of view (Leong et al., 2016; Salazar-Gatzimas et al., 2016) .

Assumption 4. The system is mirror anti-symmetric. This means that if r(t) is the system’s response to stimulus sx,t, then the system will respond to s(-x,t) with response -r(t). This is an important assumption for motion processing systems, and we will discuss its implications for Volterra kernels later in this section.

Given the first three assumptions, we model the fly’s turning response r(t) as

r(t)=ξ=1NEMD(rξ(0)(t)+rξ(1)(t)+rξ(2)(t)+ rξ(3)(t)),

where

rξ(0)(t)=r0,
rξ(1)(t)=dτ1K(1)(τ1)s(xξ,tτ1)+dτ1K(1)(τ1)s(xξ+Δx,tτ1),
rξ(2)(t)=η1{0, Δx}η2{0, Δx}dτ1dτ2K(2)(η1,η2,τ1,τ2)s(xξ+η1,tτ1)s(xξ+η2,tτ2),
rξ(3)(t)=η1{0, Δx}η2{0, Δx}η3={0, Δx}dτ1dτ2dτ3K(3)(η1,η2,η3,τ1,τ2,τ3)s(xξ+η1,tτ1)s(xξ+η2,tτ2)s(xξ+η3,tτ3).

The response r(t) is the sum of responses contributed by the ith-order kernel at different spatial locations ξ, denoted with rξ(i)(t). K(1)(τ1), K(2)(η1,η2,τ1,τ2), and K(3)(η1,η2,η3,τ1,τ2,τ3) approximates how each EMD maps the inputs s(xξ,t) and s(xξ+Δx,t) into its output. NEMD=54 is the number of putative EMD units.

Next, we will simplify the notations for K(2)(η1,η2,τ1,τ2) and K(3)(η1,η2,η3,τ1,τ2,τ3) , and discuss symmetries in these kernels.

The second-order kernel K(2)(η1,η2,τ1,τ2) is a four-dimensional tensor, with 2 dimensions in time and 2 dimensions in space. Since |η1η2|Δx , the values that can be taken are limited 2 discrete points, and all possible combinations of (η1,η2) are (L,L), (L,R)(R,L) and  (R,L ), where L(R) denotes left (right). We replace the spatial arguments with subscripts, and the 4-dimensional K(2)(η1,η2,τ1,τ2) can be rewritten with four 2-dimensional kernels,

K(2)(η1,η2,τ1,τ2)={KLL(2)(τ1,τ2)KLR(2)(τ1,τ2)KRL(2)(τ1,τ2)KRR(2)(τ1,τ2).

In Kη1,η2(2)(τ1,τ2), the first (second) temporal argument is related to the spatial location represented by the first (second) subscript. For example, KLR(2)(τ1,τ2) means K(2)(η1=L,η2=R,τ1,τ2).

There are three types of symmetries in these Kη1,η2(2)(τ1,τ2). First, since the system is translationally-invariant, KLL(2)(τ1,τ2)=KRR(2)(τ1,τ2). Second, KLR(2)(τ1,τ2) describes how the system responds to sL(tτ1)sR(tτ2), and KRL(2)(τ2,τ1) describes how the system responds to sR(tτ2)sL(tτ1).

Because multiplication operator is commutative, that is  sL(tτ1)sR(tτ2)=sR(tτ2)sL(tτ1), one should have KLR(2)(τ1,τ2)=KRL(2)(τ2,τ1).

Therefore, the second-order response of an EMD can be simplified as,

rξ(2)(t)=dτ1dτ2[2KLR(2)(τ1,τ2)sξ(tτ1)sξ+1(tτ2)+KLL(2)(τ1,τ2)(sξ(tτ1)sξ(tτ2)+sξ+1(tτ1)sξ+1(tτ2))],

where we replace the spatial arguments in stimulus with discrete subscripts ξ as well.

Third, to derive the consequences of mirror anti-symmetry, note that the second-order kernel is locally sensitive to two discrete points in space, s(x,t)=[sL(t), sR(t)]. Mirror-anti-symmetry states that if the system’s response to sL(t)=a(t) and sR(t)=b(t) is sRt=b(t), then the system’s response to the reflected stimulus, sL(t)=b(t) and sR(t)=a(t), is r(t), where a(t) and b(t) are arbitrary functions of time. The system’s response when sL(t)=a(t) and sR(t)=b(t) is

rξ,ab(2)(t)=dτ1dτ2 [KLL(2)(τ1,τ2)(a(tτ1)a(tτ2)+b(tτ1)b(tτ2))+2KLR(2)(τ1,τ2)a(tτ1)b(tτ2)],

and the response when sL(t)=b(t) and sR(t)=a(t) is

rξ,ba(2)(t)=dτ1dτ2 [KLL(2)(τ1,τ2)(b(tτ1)b(tτ2)+a(tτ1)a(tτ2))+2KLR(2)(τ1,τ2)b(tτ1)a(tτ2)].

If rξ,ba(2)(t)=rξ,ab(2)(t) for any a(t) and b(t), then one must have

KLL(2)(τ1,τ2)=KLL(2)(τ1,τ2)=0,

and

KLR(2)(τ1,τ2)=KLR(2)(τ2,τ1).

Thus, the general expression for the second-order response of the EMD further simplifies to

rξ(2)(t)=2dτ1dτ2KLR(2)(τ1,τ2)sξ(tτ1)sξ+1(tτ2).

We analyzed the third-order kernel in a similar manner. It is a 6-dimensional tensor, with 3 dimensions in time and 3 dimensions in space. Its spatial arguments are also limited to two points, and all possible combinations of (η1, η2, η3) are (L,L,L), (R,R,R), (L,L,R), (L,R,L), (R,L,L), (R,R,L), R,R,L and (L,R,R). We replace the spatial arguments with subscripts, and represent the 6-dimensional third-order kernel with eight 3-dimensional kernels,

K(3)(η1,η2,η3,τ1,τ2,τ3)={KLLL(3)(τ1,τ2,τ3)KRRR(3)(τ1,τ2,τ3)KLLR(3)(τ1,τ2,τ3)KLRL(3)(τ1,τ2,τ3)KRLL(3)(τ1,τ2,τ3)KRRL(3)(τ1,τ2,τ3)KRLR(3)(τ1,τ2,τ3)KLRR(3)(τ1,τ2,τ3).

As was the case of the second-order kernel, in Kη1,η2,η3(3)(τ1,τ2,τ3) the first (second, third) temporal argument is related to the spatial location denoted with the first (second, third) subscript. For example, KLLR(3)(τ1,τ2,τ3)=K(3)(η1=L,η2=L,η3=R,τ1,τ2,τ3).

As before, these eight blocks are redundant. From spatial-invariance and the commutative property of multiplication, one finds

KLLL(3)(τ1,τ2,τ3)=KRRR(3)(τ1,τ2,τ3),
KLLR(3)(τ1,τ2,τ3)=KLRL(3)(τ1,τ3,τ2)=KRLL(3)(τ3,τ1,τ2),
KRRL(3)(τ1,τ2,τ3)=KRLR(3)(τ1,τ3,τ2)=KLRR(3)(τ3,τ1,τ2).

For each 3D tensor, at least two out of the three spatial arguments are the same, so there is symmetry within the tensor. For example,

KLLR(3)(τ1,τ2,τ3)=KLLR(3)(τ2,τ1,τ3),

and

KLRR(3)(τ1,τ2,τ3)=KLRR(3)(τ1,τ3,τ2).

Following this logic, the third-order response of an EMD can be written as

rξ(3)(t)=dτ1dτ2dτ3[3KLLR(3)(τ1,τ2,τ3)sξ(tτ1)sξ(tτ2)sξ+1(tτ3)+3KRRL(3)(τ1,τ2,τ3)sξ+1(tτ1)sξ+1(tτ2)sξ(tτ3)+KLLL(3)(τ1,τ2,τ3)(sξ(tτ1)sξ(tτ2)sξ(tτ3)+sξ+1(tτ1)sξ+1(tτ2)sξ+1(tτ3))].

One can again apply the definition of mirror anti-symmetry to find

KLLL(3)(τ1,τ2,τ3)=0,
KLLR(3)(τ1,τ2,τ3)=KRRL(3)(τ1,τ2,τ3).

Thus, with mirror anti-symmetry, the third-order response of an EMD can be simplified as,

rξ(3)(t)=3dτ1dτ2dτ3KLLR(3)(τ1,τ2,τ3)[(sξ(tτ1)sξ(tτ2t)sξ+1(tτ3)sξ+1(tτ1)sξ+1(tτ2)sξ(tτ3))].

We have shown how mirror anti-symmetry manifest in the second and third-order kernels. In a similar way, for the zero- and first-order kernel, mirror anti-symmetry implies that,

r0=0,
K(1)(τ1)=0.

Intuitively, r0 term is the response of the EMD when there is no visual input. A reasonable motion detector should not be biased to turn left or right when there is no visual input. The first-order kernel describes how the response is influenced by stimulus at a single spatial location, which should not give any motion information, thus K(1)(τ1)=0.

In summary, in our model, the turning response is

r(t)=ξrξ(t)=ξ(rξ (2)(t)+rξ(3)(t)),
rξ(2)(t)=2dτ1dτ2KLR(2)(τ1,τ2)sξ(tτ1)sξ+1(tτ2),
rξ(3)(t)=3dτ1dτ2dτ3KLLR(3)(τ1,τ2,τ3)[(sξ(tτ1)sξ(tτ2)sξ+1(tτ3)sξ+1(tτ1)sξ+1(tτ2)sξ(tτ3))],

where r ξ (t) is the response of each EMD at spatial location ξ. In the experiments, we discretized time into bins of size Δt=1/60 seconds, so the integrals become summations, and we write the local responses as

rξ(2)(t)=2τ1,τ2KLR(2)(τ1,τ2)sξ(tτ1)sξ+1(tτ2)(Δt)2,
rξ(3)(t)=3τ1,τ2,τ3KLLR(3)(τ1,τ2,τ3)[(sξ(tτ1)sξ(tτ2)sξ+1(tτ3)sξ+1(tτ1)sξ+1(tτ2)sξ(tτ3))](Δt)3.

Appendix 2

Manipulating image statistics with maximum entropy distributions

To investigate how the skewness of the image contrast distribution influences motion detection, we want to develop a method for generating synthetic images with specified skewness values. In Section 1, we formalize this image synthesis problem and provide a rationale for the maximum entropy method. In Section 2, we derive formulae for finding maximum entropy distributions (MED) with constrained lower-order moments. In Section 3, we discuss how the contrast range sets implicit constraints on the MED.

1. Motivation

We consider each 1-dimensional natural image to have its own contrast distribution, Ppixel(i)(X), such that the contrast of each pixel in the ith image is an independent sample from Ppixel(i)(X). We would like to generate a synthetic contrast distribution, Psyn(i)(X), such that Psyn(i)(X) shares certain low-order contrast statistics with Ppixel(i)(X). More specifically, we want Psyn(i)(X) and Ppixel(i)(X) to have matched means, variances, and/or skewness levels. However, it’s important to recognize that Psyn(i)(X) is ambiguously determined by its lower-order moments, because many distributions share finite set of moments. Nevertheless, among all of these qualified distributions, the distribution with maximal entropy is unique. The maximum entropy requirement thus implicitly specifies all unconstrained statistics. Maximum entropy distributions are also conceptually appealing because they have the least statistical structure consistent with the set of chosen constraints. Here we use maximum entropy distributions to generate synthetic images with specific statistics by sampling pixel contrasts from Psyn(i)(X). By manipulating the statistics that define Psyn(i)(X), we are able to manipulate the contrast statistics of the synthetic image.

2. Solving maximum entropy distribution by minimizing the free energy function

To illustrate the structure of the maximum entropy distribution (MED) with constrained moments, we concretely consider the example when the MED is constrained to have a specific mean, variance, and skewness. We refer to this distribution as the mean-variance-and-skewness constrained MED. Since constraining the mean, variance, and skewness is equivalent to constraining the first, second and third moments, we equivalently refer to this distribution as the third-order MED.

The entropy of a random variable X is

H(X)=E[logP(X)],

where P(X) is the probability distribution of X, and E[] is the expectation operator over P(X). In particular, E[]=dxP(X=x)[]. By definition, the third-order MED is the probability distribution, P(X), such that

P(X)=argmaxP(X) H(X),

subject to,

E[Xi]=μi, i=1,2,3.

where μ1, μ2 and μ3 are the first-, second-, and third moments. To optimize for P(X), one can introduce Lagrange multipliers, λ0,λ1,λ2 and λ3, to enforce the normalization of the probability distribution and the three moment constraints. The problem is thus transformed into extremizing

L(P(X),λ0,λ1, λ2, λ3)=E[logP(X)]+λ0(dxP(X=x)1)+i=1,2,3λi(E[Xi]μi).

Note that L(P(X),λ) is a functional of P(X), and a necessary condition for P(X) to extremize L is

0=δLδP(X=x)=(logP(X=x)+1)+λ0+λ1x+λ2x2+λ3x3.

This implies that third-order MED has the form

P(X=x)=exp(1+λ0+λ1x+λ2x2+λ3x3),

and the Lagrange multipliers must be solved to satisfy dxP(X=x)=1, and E[Xi]=μi, i=1,2,3.

An alternative to solving the nonlinear constraint-satisfaction equations for the Lagrange multipliers is to find λ=(λ1,λ2,λ3) such that

λ=argminλF(λ),

where

F(λ)=log(Z(λ)),Z(λ)=dxQ(x, λ),Q(x, λ)=exp(λ1(xμ1)+λ2(x2μ2)+λ3(x3μ3)).

Here Q(x,λ) can be thought of an unnormalized probability distribution for X, Z(λ) is the normalizing factor, and F(λ) is the log of this normalizing factor. Readers familiar with statistical mechanics will recognize F(λ) as the free energy function and Z(λ) as the partition function. It turns out that the third-order MED is

P(X)=1Z(λ)Q(x,λ).

This distribution manifestly has the right functional form, so to prove that it is the third-order MED we merely need to show that all constraints are satisfied by the distribution. Indeed, the constraint that dxP(X=x)=1 is trivially satisfied because the partition function Z(λ) normalizes Q(x,λ). Since λ minimizes the free energy, we also know

F(λ)λi=0.

This derivative is easily evaluated as

F(λ)λi=log(Z(λ))λi=1Z(λ)Z(λ)λi=1Z(λ)dxQ(x, λ)λi=1Z(λ)dxQ(x, λ)λi=1Z(λ)dxQ(x, λ)(xiμi)=Eλ[Xi]μi,

where Eλ[] is the expectation operator over Pλ(x)=Q(x,λ)/Z(λ). Consequently,

 F(λ)λi=0Eλ[Xi]=μi,

and minimizing the free energy provides parameters that guarantee that the constraints are satisfied.

This free energy formulation is computationally convenient because F(λ) is a convex function that can be easily minimized using powerful techniques from convex optimization. To see this, note that the matrix of second derivatives is,

2F(λ)λiλj=λi1Z(λ)Z(λ)λj=(1Z(λ))2Z(λ)λiZ(λ)λj+1Z(λ)2Z(λ)λiλj.

We showed above that

1Z(λ)Z(λ)λi=Eλ[xiμi](1Z(λ))2Z(λ)λiZ(λ)λj=Eλ[xiμi]Eλ[xjμj].

Similarly, we next note that

1Z(λ)2Z(λ)λiλj=1Z(λ)2λiλjdxQ(x, λ)=1Z(λ)dx2Q(x, λ)λiλj=1Z(λ)dxQ(x, λ)(xiμi)(xjμj)=Eλ[(xiμi)(xjμj)].

Therefore, these two terms together imply that

2F(λ)λiλj=Eλ[(xiμi)(xjμj)]Eλ[xiμi]E[xjμj]=Covλ[xiμi,xjμj ].

Since the second derivative of F(λ) is a covariance matrix for all values of λ, and all covariance matrices are positive semi-definite, this implies F(λ) is a convex function.

Similarly, one can find the maximum entropy distribution (MED) with constrained mean and variance by minimizing another free energy function:

λ=(λ1,λ2)=argminλF(λ1,λ2),

where

F(λ)=log(Z(λ)),Z(λ)=dxQ(x, λ),Q(x, λ)=exp(λ1(xμ1)+λ2(xμ2)),

and the maximum entropy distribution will be

P(X)=1Z(λ)Q(x,λ).

We refer this distribution as the mean-and-variance-constrained MED or second-order MED.

In practice, we solved each MED by numerically minimizing its associated free energy function with the fminunc function in MATLAB v2018a.

3. Contrast bounds set implicit constraints on the MED

In the previous section, we emphasized that the free energy, F, depends on the Lagrange multipliers parameters  λ, because optimizing over these parameters allowed us to find the corresponding MED. However, the free energy also depends on the constrained moments, μ, which entered the formulae as a set of fixed parameters. Moreover, the free energy depends on the state space of the random variable, which is the set of values that X can take. For continuous variables, this manifests as the integral bound, [a,b], in the partition function, Z(λ)=abdxQ(x, λ). Therefore, we generally expect

λ=λ(μ,a,b)=argminλF(λ|μ,a,b).

Since X represents the pixel contrast, we refer to [a,b] as the contrast range and a or b separately as contrast bounds. In this section, we discuss in detail how λ depends on [a,b]. Since [a,b] influences the MED, we say [a,b] sets an implicit constraint on the MED.

A natural choice of the contrast range is the entire real line. For example, when one constrains the mean and variance of a distribution and defines X on the entire real line, then the MED is the Gaussian distribution. Moreover, by rewriting the probability density of the Gaussian

P(X=x)=12πσ2exp((xμ)22σ2),

in a slightly different form

P(X=x)=e1/22πσ2exp(μσ2(xμ)12σ2(x2σ2μ2)),

we may identify the Lagrange multipliers that minimize the free energy as, λ1=μσ2, λ2=12σ2.

In the Gaussian distribution, the probability density drops to zero very fast as x goes to positive or negative infinity, because λ2x2 is a large negative number in either case. However, when one wants to find an MED whose highest-order constraint is an odd-order moment, a finite contrast bound becomes necessary. This is because the sign of λ2n+1x2n+1 depends on the sign of x for any natural number n, thereby causing the probability density to approach zero on one half of the real line and to explode on the other half. To illustrate this point concretely, consider the first-order MED that has a constrained mean. The first-order MED must take the form

P(X=x)=1Zexp(λ1(xμ1)).

In this distribution, exp(λ1(xμ1)) blows up when x either goes to + or (unless λ1=0). In any case, no normalizing factor Z could satisfy +1Zexp(λ1(xμ1))=1. Therefore, the first-order MED is generally ill-defined on the real line.

We thus sought to compute the dependence of the first-order MED on its contrast range. The parameters in the MED minimize the free energy

 λ=argminλF(λ|a,b,μ1)=argminλlogabdxexp(λ(xμ1)).

Without loss of generality, we can shift the x-axis such that the contrast range is symmetric around 0. Setting x=x12(b+a), we find

λ=argminλ F(λ|b,μ)=argminλ logbbdxexp(λ(xμ)),

where b=12(ba) and μ=μ112(b+a).

This optimization problem for the first-order MED can be solved analytically. First, note that the partition function is

Z(λ)=bbdxexp(λ(xμ))={2bλ=0 1λ(eλ(bμ)eλ(bμ))λ0 .

This partition function is continuous at λ=0. To see this, note that

limλ0Z(λ)=limλ0eλ(bμ)eλ(bμ)λ=limλ0(bμ)eλ(bμ)(bμ)eλ(bμ)1=2b=Z(0),

where we used L’Hôpital’s rule to evaluate the limit. Therefore the free energy,

F(λ|b,μ)={log(2b)λ=0 log(eλ(bμ)eλ(bμ))log(λ)λ0 ,

is also continuous at λ=0. Furthermore, the partition function is differentiable at λ=0. In particular,

dZdλ|λ=0 =limh0Z(h)Z(0)h =limh01h(eh(bμ)eh(bμ))2bh=limh0eh(bμ)eh(bμ)2bhh2=limh0(bμ)eh(bμ)(bμ)eh(bμ)2b2h=limh0(bμ)2eh(bμ)(bμ)2eh(bμ)2=2bμ,

where we again used L’Hôpital’s rule. Consequently, the derivative of the free energy at λ=0 is

dFdλ|λ=0=1ZdZdλ|λ=0=2bμ2b=μ.

The derivative for λ0 is straightforward to evaluate, and we find

dFdλ|λ0=(bμ)eλ(bμ)(bμ)eλ(bμ)eλ(bμ)eλ(bμ)1λ=μ+beλb+eλbeλbeλb1λ.

Therefore, the final formula is

dF(λ|b,μ)dλ={μλ=0 μ+beλb+eλbeλbeλb1λλ0.

We determine λ by setting this expression equal to zero.

Most simply, if μ=0, then λ=0 minimizes the free energy, and the MED is a uniform distribution. This result is intuitive. Without an explicit mean constraint, the zeroth-order MED with a finite contrast bound is a uniform distribution with zero mean, and the requirement that the distribution have zero mean is already satisfied.

If μ0, we need to solve

0=μ+beλb+eλbeλbeλb1λ.

We rearrange this expression to find

μb=eλb+eλbeλbeλb1λb=coth(λb)1λb,

where coth(x) is the hyperbolic cotangent function. If we define f(x)=coth(x)1x, then we can rewrite the above equation as

μb=f(λb),

and we get

λ=1bf1(μb).

From the above equation, we observe that λ depends only on μb and 1b. From Appendix 2—figure 1A, we see that λ monotonically increases with μb. Intuitively, μb sets the relative scale of μ in the contrast range and the degree of asymmetry in the distribution. For example, when μ is zero, we know that λ is zero and the MED is perfectly symmetric. As μ deviates from 0, λ deviates from zero and the distribution distorts asymmetrically from the uniform distribution to satisfy the non-zero mean. As the ratio of μ and b grows larger, λ has to deviate more from 0, and the distribution becomes increasingly asymmetric. When μb is small, λ is roughly linear in μb. Also note that there is a simple proportionality between λ and 1b. This dependence is easily understood by the requirement that λx is dimensionless.

Appendix 2—figure 1
The shape of MED depends on the contrast range.

(A) The shape of function f1(x), where f(x)=coth(x)1/x. (B) The mean-variance-constrained MED depends on the contrast range. The mean is constrained to be 0. The upper bound of the contrast range is fixed at 1.5, and the ratio of the lower-bound to upper bound changes from −1 to −0.2. As the contrast range becomes more asymmetric, the MED becomes more asymmetric. (C) When the contrast range is asymmetric, the mean-variance-constrained MED has induced non-zero skewness.

The mean-and-variance-constrained MED also depends on the finite contrast range. To simplify the discussion, we again set the zero point of x-axis such that the contrast range is symmetric around 0. Thus, our task is to find λ1,λ2 that minimize the free energy function,

λ=(λ1,λ2)=argminλF(λ1,λ2|μ1,μ2,b)=argminλlogbbdxexp(λ1(xμ1)+λ2(x2μ2)),

where μ1 and μ2 are the shifted first-order and second-order moments imposed on the MED, and [b,b] is the shifted contrast range. As was the case for the first-order MED, when μ1=0, the distribution is symmetric and λ1=0. Furthermore, as the contrast bound becomes large, the distribution approaches the Gaussian distribution. However, when μ10, the distribution has to distort asymmetrically to achieve the non-zero mean, with large distortions occurring when the mean is within a few standard deviations of the boundary (Appendix 2—figure 1B). As a result, even though we did not require the distribution to have a specific skewness, the second-order MED can have non-zero skewness (Appendix 2—figure 1C).

These mathematical considerations lead to the design of our numerical methods. Operationally, for each individual image, we found its contrast range, [cmin,cmax], and the largest contrast magnitude, δc=max(|cmin|,|cmax|). Then we constructed a symmetric contrast range around the constrained mean μ1, and computed the MED on [μ1δc,μ1+δc] . This avoided inducing non-zero skewness in the mean-and-variance-constrained MED. For consistency, we used the same symmetric contrast range for the mean-variance-and-skewness-constrained MED. Since the contrast bound implicitly influences the shape of the MED and different individual images have different contrast ranges, this method takes guidance from natural scenes and uses heterogenous contrast ranges that match the contrast range of each original natural scene. However, this heterogeneity introduces variability across images that could potentially impact the estimation accuracy we observed. Therefore, we also simulated secondary datasets where we used a common symmetric bound for all synthetic images. We chose this bound such that we could successfully solve the MED for the mean, variance, and skewness levels of most natural images.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
    A maximum entropy approach to natural language processing
    1. AL Berger
    2. VJD Pietra
    3. SAD Pietra
    (1996)
    Computational Linguistics 22:39–71.
  8. 8
    Biophysics: Searching for Principles
    1. W Bialek
    (2012)
    Princeton University Press.
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
    Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems
    1. P Dayan
    2. LF Abbott
    (2001)
    MIT Press.
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
    Visual pattern discrimination
    1. B Julesz
    (1962)
    IEEE Transactions on Information Theory 8:84–92.
    https://doi.org/10.1109/TIT.1962.1057698
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
    From understanding computation to understanding neural circuitry
    1. D Marr
    2. T Poggio
    (1976)
    Neurosciences Research Program Bulletin, 15.
  57. 57
  58. 58
    Deep learning models of the retinal response to natural scenes
    1. LT McIntosh
    2. N Maheswaranathan
    3. A Nayebi
    4. S Ganguli
    5. SA Baccus
    (2016)
    Advances in Neural Information Processing Systems 29:1369–1377.
  59. 59
    Panoramic High Dynamic Range Images in Diverse Environments
    1. HG Meyer
    2. A Schwegmann
    3. JP Lindemann
    4. M Egelhaaf
    (2014)
    Bielefeld University.
  60. 60
    The analysis of moving visual patterns
    1. JA Movshon
    2. EH Adelson
    3. M Gizzi
    4. WT Newsome
    (1985)
    In: C Chagas, R Gattass, C Gross, editors. Pattern Recognition Mechanisms. Springer-Verlag Berlin Heidelberg. pp. 117–151.
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
    The Volterra and Wiener Theories of Nonlinear Systems
    1. M Schetzen
    (1980)
    Krieger Publishing Co., Inc.
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
    Nonlinear Problems in Random Theory
    1. N Wiener
    (1966)
    Massachusetts: The MIT Press.
  96. 96
  97. 97
  98. 98
  99. 99

Decision letter

  1. Stephanie Palmer
    Reviewing Editor; University of Chicago, United States
  2. Ronald L Calabrese
    Senior Editor; Emory University, United States
  3. Fred Rieke
    Reviewer; University of Washington, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Measured perceptual nonlinearities show how ON-OFF asymmetric processing improves motion estimation in natural scenes" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Ronald Calabrese as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Fred Rieke (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This paper takes a close look into the natural scene statistics that influence behavior. In particular, the work investigates the importance of second- and third-order stimulus correlations on motion sensing in flies. Second-order correlations have been emphasized in past work, and a few past studies have suggested that third-order correlations also contribute, but this work provides the first comprehensive treatment of the issue. The paper presents an analysis of fly behavioral responses to moving patterns based on rigid rotations of static natural scenes, either with fully natural statistics, or modified parametric stimuli that retain the skewness in the contrast distribution of natural scenes. Analysis of behavioral responses to these stimulus variants indicates that the fly uses third -order computations effectively to counter some of the systematic (and well-known) variations introduced by the second-order computations prescribed by canonical motion detection models. Both the experiments and the analysis are presented clearly, and the authors make a convincing case for their conclusion that the third-order term improves velocity estimation in natural scenes. The computational strategies discovered here likely extend to other visual systems. The stimulus construction algorithm developed here could also be used for other model systems and analyses.

Essential revisions:

1) As pointed out in the Discussion section, the Volterra kernel approximation represents a general and systematic approach to the identification of nonlinear systems (with practical caveats, as also mentioned). But there is also a more fundamental issue: Volterra kernels are essentially high dimensional polynomial approximations, and as such, although they are general, they are extremely inefficient in describing compressive nonlinearities (and biological systems are essentially always compressive in their typical range of operation). Motion estimation by correlation is a second-order Volterra operation, and therefore diverges quadratically. To counter that divergence, the system must somehow normalize the response to the contrast variance, such as in contrast normalization. And this is what seems to happen in biological motion detectors: It is well known that motion responses saturate as a function of contrast.

In this light, the finding that the third-order Volterra kernel improves velocity estimates can then be simply seen as a corollary of the fact that the third-order kernel is essentially a "fitting parameter" which is necessarily better than a fit to second-order. Alternatively, as noted in the paper, the third-order behavior could be due to more or less fixed asymmetries in processing, or even due to nonlinear behavior in early vision. Something along these lines was proposed by Bulthoff and Goetz (Bulthoff and Goetz, 1979) to explain a motion illusion in human and fly vision (and that should be cited in this paper).

These points should be addressed fully in the text of the paper, both in the Results and in the Discussion.

2) Response predictions and quantification. The kernel approach estimates behavioral responses, but the predictions are then compared directly to image velocity. Implicit in this (at least not clearly stated in the text of the manuscript) is that the behavior is linearly related to the image velocity – or at least that the two can be swapped in the analysis. It would be good to clarify this issue somewhere around the second paragraph of the subsection “Third-order kernel improve velocity estimation in moving natural scenes” where this analysis is described. At present, the text could be misread to indicate that image velocity is being estimated directly.

3) In the Discussion section:

- The interpretation of the main result (that the third-order term improves velocity estimation in natural scenes) feels a bit narrow; the authors should remedy that in the Discussion section.

- Ties to neural data need to be made a bit more extensive and detailed. Is there evidence to suggest that the kinds of optimal temporal filters derived here are found in the fly brain? Perhaps suggest a few other places to look, beyond what's already discussed (measuring T4/T5 kernels).

- Comment a bit more on the limitations/advantages of considering only rigid rotations applied to static natural or naturalistic scene stats. Perhaps also comment on using walking versus flight, and note that flies spend most of their time walking.

- Comment a bit more on the potential behavioral relevance of the magnitude of the effect of the improvement from third-order filter input to second-order processing.

- How prevalent is positive skewness in different natural scenes? Are there scenes/environments where one should expect more or less skewness? Would visual systems that operate in specialized niches be expected to have different nonlinearities in their motion processing as a result, or will all visual systems require some amount of compensation in their second-order motion detectors?

Title revisions:

Replace "perceptual" with "behavioral"

Suggested stylistic revisions:

The papers dives into contrast-polarity sensitivity right away, before sufficiently motivating the research approach as a whole. It also seems like this feature might require a more intuitive figure to explain the effect to the uninitiated. Perhaps a few more words about the behavioral advantages of this level of careful motion dissection would be useful.

In the last paragraph of the Introduction, it would be good to broaden the perspective a bit on how filters are usually computed in behavioral and neural recording experiments, how this relates to other filter-finding techniques, and what the third-order kernels might reveal. For a broad audience, it might also be good to remind the reader what a Volterra expansion is and add a note about adding kernels to the expansion versus static nonlinearities, along the lines discussed above.

Here and there the paper reads a bit too much like a travel journal – a few examples:

- The paragraphs in the subsection “The structure of natural scenes induces noise in second-order motion estimates” all have something like "We did something…Then we observed something, etc."

- The legend to Figure 3 is full of "we did this or that".

Consider revising the text to make the language just a bit more formal and deliberate in guiding the reader through the major scientific findings in the paper, by motivating and previewing the steps taken in this careful and thorough work.

There are a number of typos that need to be corrected, through a close reading of the text.

https://doi.org/10.7554/eLife.47579.sa1

Author response

Essential revisions:

1) As pointed out in the Discussion section, the Volterra kernel approximation represents a general and systematic approach to the identification of nonlinear systems (with practical caveats, as also mentioned). But there is also a more fundamental issue: Volterra kernels are essentially high dimensional polynomial approximations, and as such, although they are general, they are extremely inefficient in describing compressive nonlinearities (and biological systems are essentially always compressive in their typical range of operation). Motion estimation by correlation is a second-order Volterra operation, and therefore diverges quadratically. To counter that divergence, the system must somehow normalize the response to the contrast variance, such as in contrast normalization. And this is what seems to happen in biological motion detectors: It is well known that motion responses saturate as a function of contrast.

We agree that a major problem of quadratic motion estimators is that they are too responsive to large contrast stimuli. We also agree that contrast normalization is one popular approach to ameliorate this issue. We have added a citation to Simoncelli and Heeger to point out how contrast normalization has to been applied to visual motion processing.

Our current approach leverages a simple algorithmic description to summarize multiple incompletely known nonlinear processing mechanisms, and remains agnostic to their origins in the circuit. The fly community is currently dissecting the biophysics of visual circuits in significant detail, and it is now clear that many distinct nonlinear mechanisms contribute to visual motion computation. As such, we agree that we’re studying phenomena that could conceivably relate to contrast normalization, but we think it is premature to conclude that contrast normalization is the origin of the effects we observe here.

Furthermore, typical contrast normalization methods generate contrast-symmetric nonlinear processing, for instance by normalizing by the variance of the inputs or by passing the (symmetric) motion signal itself through a normalizing or saturating process. In comparison to these forms of contrast-symmetric processing, the third-order kernel captured the contrast-asymmetric nonlinearity in the circuit, which is qualitatively different from standard contrast normalization.

We have addressed these issues through several changes in the manuscript. We made three changes to the Results section where the Volterra series expansion first appeared. First, we explicitly made the connection between the Volterra series and a polynomial expansion.

“Similar to the Taylor series from calculus, the Volterra series is a polynomial description of a nonlinearity, with a first-order kernel that describes linear transformations, a second-order kernel that captures quadratic terms, and higher-order kernels that combine to represent a wide variety of nonlinearities beyond the second-order.”

We then mention that Volterra series might converge inefficiently to the true mechanistically accurate nonlinearity and mentioned that there are convenient alternatives to our approach. Updated text:

“However, many polynomial terms can be needed to describe some nonlinearities. For instance, the polynomial description of a compressive, saturating nonlinearity is inefficient, and it can be easier to describe such transformations using alternative nonlinear model architectures, such as linear-nonlinear cascade models (Dayan et al., 2001).”

Finally, we pointed out that this expansion should be taken at the algorithmic, rather than the mechanistic level:

“We emphasize that the Volterra kernel description is explicitly algorithmic, as it aims to summarize the overall system processing without considering the mechanisms leading to this processing.”

This latter point permitted us to get into more details in the Discussion section. In particular, we added asymmetric contrast adaptation in the ON-OFF pathway as a potential nonlinearity that could lead to light-dark asymmetric motion processing. The text there was updated as follows:

“Yet it remains unclear whether asymmetric responses of T4 and T5 are inherited from upstream neurons. For instance, contrast adaptation could differ between the two pathways (Chichilnisky and Kalmar, 2002).”

In this light, the finding that the third-order Volterra kernel improves velocity estimates can then be simply seen as a corollary of the fact that the third-order kernel is essentially a "fitting parameter" which is necessarily better than a fit to second-order.

We do not think that this assertion is correct. It is true that one will better approximate a system by adding more fitting parameters, and we indeed better predicted optomotor turning responses with the added third-order kernel. However, we fitted a model to explain optomotor turning behavior, but we evaluated the model’s performance on a different task, i.e., velocity estimation. There is no guarantee that the system responsible for optomotor behavior is also optimized for velocity estimation, nor that including a third-order Volterra term in the fit would improve velocity estimates in natural scenes.

When we evaluated the model fitted to optomotor behavior with the task of velocity estimation in natural scenes, we had three implicit hypotheses. 1. We hypothesize that the visual system is optimized for velocity estimation. 2. We hypothesize that the turning response is a readout of that velocity estimate. 3. We hypothesize that the system is tailored specifically to the idiosyncrasies of natural scenes, rather than working for arbitrary image ensembles. The results are consistent with our hypotheses, and our finding that the third-order fit does improve natural scene performance is an exciting non-trivial result of this paper. It is not a corollary to fitting a model with more parameters.

We first added an introductory Figure 3 paragraph in the Results section to explicitly state our assumption that optomotor behavior is a proxy for the fly’s internal estimate of the image velocity:

“The kernels were fit to turning behavior, so the output of the model to moving visual stimuli is the predicted optomotor turning response. […] Using the fitted behavioral model, we could thus investigate how accurately the fly’s velocity estimate tracks the true image velocity.”

We then elaborated our results of improving velocity estimation with the third-order kernel in the Discussion section:

“In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second- and third-order terms in the Volterra series contributed to the turning behavior. […] Therefore, these results can be taken together to motivate the hypothesis that the magnitude of the fly’s turning response is determined by an internal estimate of velocity.”

Alternatively, as noted in the paper, the third-order behavior could be due to more or less fixed asymmetries in processing, or even due to nonlinear behavior in early vision. Something along these lines was proposed by Bulthoff and Goetz (Bulthoff and Goetz, 1979) to explain a motion illusion in human and fly vision (and that should be cited in this paper).

We agree that it’s interesting to ask how nonlinear circuits could develop sensitivity to third-order motion cues. In previous work, we have investigated a variety of circuit nonlinearities that could give rise to third-order motion processing (Fitzgerald and Clark, 2015), including contrast asymmetric front-end nonlinearities of the type suggested by Bulthoff and Goetz. Thank you for pointing us to this reference. We have now cited it. Our text around this issue now reads:

“Previous work has suggested that front-end nonlinearities could account for certain optomotor illusions in flies (Bülthoff and Götz, 1979), and it is conceivable that such nonlinearities could generate contrast asymmetric motion responses (Clark and Demb, 2016; Fitzgerald et al., 2011). […] Alternatively, nonlinear processing at the level of direction-selective T4 and T5 neurons could also generate the asymmetries we observed here.”

These points should be addressed fully in the text of the paper, both in the Results and in the Discussion.

We hope that you will find that these modifications have improved the paper and have fully addressed the concerns of the reviewers.

2) Response predictions and quantification. The kernel approach estimates behavioral responses, but the predictions are then compared directly to image velocity. Implicit in this (at least not clearly stated in the text of the manuscript) is that the behavior is linearly related to the image velocity – or at least that the two can be swapped in the analysis. It would be good to clarify this issue somewhere around the second paragraph of the subsection “Third-order kernel improve velocity estimation in moving natural scenes” where this analysis is described. At present, the text could be misread to indicate that image velocity is being estimated directly.

Thank you for pointing out that this issue was unclear in our manuscript. First, in line with comment 1, we extended the introduction to Figure 3 to state explicitly that our model fitted behavioral responses but predicted image velocity. Second, we have added a few sentences to clarify the role of the linear assumption in our chosen accuracy metric. In particular, before presenting the results of Figure 3C, we now say:

“We quantified how well the model’s responses predicted the image velocity using the Pearson correlation coefficient. This metric supposes that the model response and image velocity are linearly related, and its value summarizes intuitively the mean-squared-error of the best linear fit between the model’s output and the image velocity.”

3) In the Discussion section:

- The interpretation of the main result (that the third-order term improves velocity estimation in natural scenes) feels a bit narrow; the authors should remedy that in the Discussion section.

Thank you for these suggestions. To address this comment, we first added a summary Discussion paragraph to broaden our interpretation of the main result:

“In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second- and third-order terms in the Volterra series contributed to the turning behavior. […] Since skewed scenes are prevalent across natural environments, and many visual systems exhibit ON-OFF asymmetric visual processing (Chichilnisky and Kalmar, 2002; Jin et al., 2011; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010; Zaghloul et al., 2003), many animals are likely to use similar strategies for motion perception.”

We also related our work to other fields. We first discussed how the methodology of Volterra series expansion is used in other fields of neuroscience:

“Polynomial approximations to complex nonlinear systems have also been useful in other domains of neuroscience. […] These motifs might relate to measurable properties of the neocortex (Song et al., 2005).”

We then discussed how ON-OFF asymmetric processing in other visual systems is connected to natural scenes statistics:

“This resonates with previous work showing that many visual systems process ON and OFF signals asymmetrically to improve other aspects of visual processing (Kremkow et al., 2014; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010). […] ON-OFF asymmetric visual processing also varies in other ways, and there is evidence that contrast adaptation in ON and OFF pathways is different between primate and salamander retinas (Chander and Chichilnisky, 2001).”

In the end, we now discuss why modeling behavior at algorithmic level is beneficial:

“David Marr famously asserted that neural computation needs to be understood at both the algorithmic and implementational levels (Marr and Poggio, 1976). […] Instead of algorithm and mechanism providing parallel or hierarchical goals, they should be treated as parts of one integrated understanding of the circuit.”

- Ties to neural data need to be made a bit more extensive and detailed. Is there evidence to suggest that the kinds of optimal temporal filters derived here are found in the fly brain? Perhaps suggest a few other places to look, beyond what's already discussed (measuring T4/T5 kernels).

To tie the measured behavioral kernel to neural data, we have added a Discussion section on potential mechanisms for the measured asymmetries. We first point out that asymmetric front-end filtering could, but would not necessarily, result in the measured third-order kernel, then discuss how asymmetric processing could originate upstream of T4/T5, in T4/T5, or differential weighting of T4 and T5 outputs downstream.

“Indeed, differentially affecting T4 and T5 activity, either through direct silencing or by manipulating upstream neurons, alters the behavioral responses of flies to triplet correlations (Clark et al., 2014; Leonhardt et al., 2016), and parallel experiments in humans similarly find that contrast-asymmetric responses are mediated by neurons separately modulated by moving ON and OFF edges (Clark et al., 2014). […] By measuring behavior and distilling the abstract algorithmic properties of the system, we will be able to constrain the contributions of individual circuit components without confining ourselves to an overly narrow class of mechanistic models.”

- Comment a bit more on the limitations/advantages of considering only rigid rotations applied to static natural or naturalistic scene stats. Perhaps also comment on using walking versus flight, and note that flies spend most of their time walking.

Thank you for this suggestion. We now comment on the limitations and advantages of these rigid rotations when we introduce them in Figure 1:

“This rigid translation of images mimics the motion produced by an animal’s pure rotation, during which visual objects all move at the same rotational velocity and occlusion does not change over time. Real motion through an environment generates more complex signals than this, but rigid translations are straightforward to compute and rotational visual stimuli are known to induce the rotational optomotor response that we focus on in this manuscript.”

We also note at the outset that Drosophila spends a large fraction of its life walking:

“Flies spend a large portion of their lives standing and walking on surfaces, making walking optomotor responses ethologically critical (Carey et al., 2006).”

- Comment a bit more on the potential behavioral relevance of the magnitude of the effect of the improvement from third-order filter input to second-order processing.

We added a Discussion section to address this suggestion. We started by comparing the behavioral magnitude in third-order glider stimuli and the computational benefits of the third-order kernel, we then compared the magnitude of the improvement added by the third-order kernel with other mechanisms such as spatial averaging, and we finally pointed out that the third-order kernel may underestimate the full improvement provided by its underlying mechanisms. This new section reads as follows:

“The HRC has explained a large number of phenomena and neural responses, and it is reasonable to ask how much we’ve gained by extending its second-order algorithmic description of fly behavior to a third-order one. […] It will be interesting to investigate whether mechanistically accurate models that explain the origin of the third-order kernel also reveal larger performance improvements.”

- How prevalent is positive skewness in different natural scenes? Are there scenes/environments where one should expect more or less skewness? Would visual systems that operate in specialized niches be expected to have different nonlinearities in their motion processing as a result, or will all visual systems require some amount of compensation in their second-order motion detectors?

Thank you for these interesting questions. There are studies showing that contrast adaptation in ON and OFF pathways in primates differs from those pathways in salamanders; that could be the result of different visual environments. We now added a section that discusses natural scene statistics in different habitats, and how natural environments would combine with early processing to shape the strategy of motion computation. This new section reads as follows:

“Adaptations to differences in both habitat and early sensory processing could potentially explain these divergences. […] These possibilities are not mutually exclusive, and in both cases, the early visual processing must work in concert with the downstream motion detectors to form robust and consistent perceptions.”

Title revisions:

Replace "perceptual" with "behavioral"

Thank you for this suggestion. We had many discussions about how well ‘perceptual’ fit here. We have now revised our title to read: “Asymmetric ON-OFF processing of visual motion cancels variability induced by the structure of natural scenes”.

Suggested stylistic revisions:

The papers dives into contrast-polarity sensitivity right away, before sufficiently motivating the research approach as a whole. It also seems like this feature might require a more intuitive figure to explain the effect to the uninitiated. Perhaps a few more words about the behavioral advantages of this level of careful motion dissection would be useful.

Thank you for this suggestion. We now spend time in the Introduction motivating the systems analysis and studying motion computation in terms of spatiotemporal correlations before we introduce contrast-polarity sensitivity:

“Higher-order correlations could also contribute to motion computation, and Bayes optimal visual motion estimators can be written as a sum of terms specialized for detecting different correlation types (Poggio and Reichardt, 1973; Potters and Bialek, 1994).[…] This sensitivity to triplet spatiotemporal correlations shows that motion perception incorporates cues neglected by canonical motion detectors.”

In the last paragraph of the Introduction, it would be good to broaden the perspective a bit on how filters are usually computed in behavioral and neural recording experiments, how this relates to other filter-finding techniques, and what the third-order kernels might reveal. For a broad audience, it might also be good to remind the reader what a Volterra expansion is and add a note about adding kernels to the expansion versus static nonlinearities, along the lines discussed above.

Thank you for this suggestion. In line with previous comments, we now added a brief introduction to the Volterra expansions in the Introduction Section. In line with Essential Revisions 1, in the Results section, we added more detailed explanation for Volterra series expansion, and relate it to other modeling techniques for nonlinear system. We also added prose to motivate why we choose Volterra expansion in the study and what the third-order kernel might reveal:

“Volterra kernels are useful for studying visual motion processing because they allow us to rigorously group response properties by their order (Fitzgerald et al., 2011; Potters and Bialek, 1994), thereby permitting us to clearly describe both canonical and contrast polarity-dependent components of the behavior. […] The third-order kernel directly measures sensitivities to triplet spatiotemporal correlations and probes ON/OFF asymmetries in motion processing.”

Here and there the paper reads a bit too much like a travel journal – a few examples:

- The paragraphs in the subsection “The structure of natural scenes induces noise in second-order motion estimates” all have something like "We did something…Then we observed something, etc."

Thank you. We have revised to add motivation throughout.

- The legend to Figure 3 is full of "we did this or that".

The legend has been updated to avoid this. Thank you.

Consider revising the text to make the language just a bit more formal and deliberate in guiding the reader through the major scientific findings in the paper, by motivating and previewing the steps taken in this careful and thorough work.

We have revised the text throughout to eliminate some of this informality and to better motivate the analyses we performed.

There are a number of typos that need to be corrected, through a close reading of the text.

As we revised and re-read the manuscript, we have corrected typos we found. We hope we caught them all.

https://doi.org/10.7554/eLife.47579.sa2

Article and author information

Author details

  1. Juyue Chen

    Interdepartmental Neuroscience Program, Yale University, New Haven, United States
    Contribution
    Conceptualization, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2176-4723
  2. Holly B Mandel

    Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, United States
    Contribution
    Conceptualization, Software, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  3. James E Fitzgerald

    Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, United States
    Contribution
    Conceptualization, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Damon A Clark
    For correspondence
    fitzgeraldj@janelia.hhmi.org
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0949-4188
  4. Damon A Clark

    1. Interdepartmental Neuroscience Program, Yale University, New Haven, United States
    2. Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, United States
    3. Department of Physics, Yale University, New Haven, United States
    4. Department of Neuroscience, Yale University, New Haven, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    James E Fitzgerald
    For correspondence
    damon.clark@yale.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8487-700X

Funding

National Institutes of Health (R01EY026555)

  • Juyue Chen
  • Damon A Clark

National Science Foundation (IOS1558103)

  • Juyue Chen
  • Damon A Clark

Chicago Community Trust (Searle Scholar Award)

  • Holly B Mandel
  • Damon A Clark

Howard Hughes Medical Institute

  • James E Fitzgerald

Alfred P. Sloan Foundation (Research Fellowship)

  • Damon A Clark

The Swartz Foundation

  • James E Fitzgerald

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors thanks Ann Hermundstad and Emily Cooper for comments on the manuscript. JEF was supported by the Swartz Foundation and the Howard Hughes Medical Institute. DAC and this research were supported by NIH R01EY026555, NIH P30EY026878, NSF IOS1558103, a Searle Scholar Award, a Sloan Fellowship in Neuroscience, the Smith Family Foundation, and the E Matilda Ziegler Foundation.

Senior Editor

  1. Ronald L Calabrese, Emory University, United States

Reviewing Editor

  1. Stephanie Palmer, University of Chicago, United States

Reviewer

  1. Fred Rieke, University of Washington, United States

Publication history

  1. Received: April 10, 2019
  2. Accepted: October 12, 2019
  3. Accepted Manuscript published: October 15, 2019 (version 1)
  4. Version of Record published: November 29, 2019 (version 2)
  5. Version of Record updated: December 5, 2019 (version 3)

Copyright

© 2019, Chen et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,071
    Page views
  • 204
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)