Variance predicts salience in central sensory processing

Version of Record: December 22, 2014
Accepted Manuscript: November 14, 2014

Download
Cite
Share
CommentOpen annotations (there are currently 0 annotations on this page).

Altmetric provides a collated score for online attention across various platforms and media.
See more details

1. Built upon by
Rat sensitivity to multipoint statistics is predicted by efficient coding of natural scenes

Riccardo Caramellino, Eugenio Piasini ... Davide Zoccolan

Research Advance Dec 7, 2021
Further reading

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
References
Article and author information
Metrics

Abstract

Information processing in the sensory periphery is shaped by natural stimulus statistics. In the periphery, a transmission bottleneck constrains performance; thus efficient coding implies that natural signal components with a predictably wider range should be compressed. In a different regime—when sampling limitations constrain performance—efficient coding implies that more resources should be allocated to informative features that are more variable. We propose that this regime is relevant for sensory cortex when it extracts complex features from limited numbers of sensory samples. To test this prediction, we use central visual processing as a model: we show that visual sensitivity for local multi-point spatial correlations, described by dozens of independently-measured parameters, can be quantitatively predicted from the structure of natural images. This suggests that efficient coding applies centrally, where it extends to higher-order sensory features and operates in a regime in which sensitivity increases with feature variability.

https://doi.org/10.7554/eLife.03722.001

eLife digest

Our senses are constantly bombarded by sights and sounds, but the capacity of the brain to process all these inputs is finite. The stimuli that contain the most useful information must therefore be prioritized for processing by the brain to ensure that we build up as complete a picture as possible of the world around us. However, the strategy that the brain uses to select certain stimuli—or certain features of stimuli—for processing at the expense of others is unclear.

Hermundstad et al. have now provided new insights into this process by analyzing how humans respond to artificial stimuli that contain controllable mixtures of features that found in natural stimuli. To do this, Hermundstad et al. selected photographs of the natural world, and measured the brightness of individual pixels. After adjusting images in a way that mimics the human retina, the brightest 50% of the pixels in each photograph were colored white and the remaining 50% were colored black.

Hermundstad et al. then used statistical techniques to calculate the degree to which the color of pixels could be used to predict the color of their neighbors. In this way, it was possible to calculate the amount of variation throughout the images, and then make computer-generated images in which pixel colorings were more or less predictable than in the natural images.

Volunteers then performed a task in which they had to locate a computer-generated pattern against a background of random noise. The volunteers were able to locate this target most easily when it contained the same kinds of patterns and features that were meaningful about natural images.

While this shows that the brain is adapted to prioritize features that are more informative about the natural world, understanding exactly how the brain implements this strategy remains a challenge.

https://doi.org/10.7554/eLife.03722.002

Introduction

Sensory receptor neurons encode signals from the environment, which are then transformed by successive neural layers to support diverse and computationally complex cognitive tasks. A normative understanding of these computations begins in the periphery, where the efficient coding principle—the notion that a sensory system is tuned to the statistics of its natural inputs—has been shown to be a powerful organizing framework (Barlow, 2001; Simoncelli, 2002). Perhaps the best-known example is that of redundancy removal via predictive coding and spatiotemporal decorrelation. In insects, this is carried out by neural processing (Laughlin, 1981; van Hateren, 1992b); in vertebrates, fixational eye movements—which precede the first step of neural processing (Srinivasan et al., 1982; Atick and Redlich, 1990; Atick et al., 1992)—play a major role (Kuang et al., 2012). This approach was later extended to describe population coding, retinal mosaic structure (Barlow, 2001; Karklin and Simoncelli, 2001; Borghuis et al., 2008; Balasubramanian and Sterling, 2009; Liu et al., 2009; Garrigan et al., 2010; Ratliff et al., 2010; Kuang et al., 2012), adaptation of neural responses (Brenner et al., 2000; Fairhall et al., 2001; Schwartz and Simoncelli, 2001), and early auditory processing (Smith and Lewicki, 2006). Taken together, normative theories based on efficient coding have been successful in explaining aspects of processing in the sensory periphery that are tuned to simple statistical features of the natural world.

Can we extend such theories beyond the sensory periphery to describe cortical sensitivity to complex sensory features? Normative theories have been successful in predicting the response properties of single cells, including receptive fields in V1 (Olshausen and Field, 1996; Bell and Sejnowski, 1997; van Hateren and Ruderman, 1998; van Hateren and van der Schaaf, 1998; Hyvarinen and Hoyer, 2000; Vinje and Gallant, 2000; Karklin and Lewicki, 2009) and spectro-temporal receptive fields in primary auditory cortex (Carlson and DeWeese, 2002, 2012), as well as distributions of tuning curves across individual cells in a population (Lewicki, 2002; Ganguli and Simoncelli, 2011). Some complex features, however, might not be represented by the tuning properties of individual cells in any direct way, but rather emerge from the collective behavior of many cells. Instead of trying to predict individual cell properties, we therefore focus on the sensitivity of the complete neural population. Is there an organizing principle that determines how resources within the population are allocated to representing such complex features?

When the presence of complex features is predictable (i.e., can be accurately guessed from simpler features along with priors about the environment), mechanisms are best devoted elsewhere (See Discussion, van Hateren, 1992a). In contrast, sensory features that are highly variable and not predictable from simpler ones can serve to determine their causes (e.g., to distinguish among materials or objects), a first step in guiding decisions. We will show that these ideas predict a specific organizing principle for aggregate sensitivities arising in cortex: the perceptual salience of complex sensory signals increases with the variability, or unpredictability, of the corresponding signals over the ensemble of natural stimuli.

To test this hypothesis, we focus on early stages of central visual processing. Here, early visual cortex (V1 and V2) is charged with extracting edges, shapes, and other complex correlations of light between multiple points in space (Morrone and Burr, 1988; Oppenheim and Lim, 1981; von der Heydt et al., 1984). We compare the spatial variation of local patterns of light across natural images with human sensitivity to manipulations of the same patterns in synthetic images. This allows us to determine how sensitivity is distributed across many different features, rather than simply determining the most salient ones. (We will say that a feature is more salient if it is more easily discriminated from white noise.) To this end, we parametrize the space of local multi-point correlations in images in terms of a complete set of coordinates, and we measure the probability distribution of coordinate values sampled over a large ensemble of natural scenes. We then use a psychophysical discrimination task to measure human sensitivity to the same correlations in synthetic images, where the correlations can be isolated and manipulated in a mathematically rigorous fashion by varying the corresponding coordinates (Chubb et al., 2004; Victor et al., 2005; Victor and Conte, 2012; Victor et al., 2013). Comparing the measurements, we show that human sensitivity to these multi-point elements of visual form is tuned to their variation in the natural world. Our result supports a broad hypothesis: cortex invests preferentially in mechanisms that encode unpredictable sensory features that are more variable, and thus more informative about the world. Namely, variance is salience.

Results

As we recently showed, some informative local correlations of natural scenes are captured by the configurations of luminances seen through a ‘glider’, that is, a window defined by a 2 × 2 square arrangement of pixels (Tkačik et al., 2010). We use this observation first as a framework for analyzing the local statistical structure of natural scenes, then to characterize psychophysical sensitivities via a set of synthetic visual texture stimuli, and finally to compare the two.

Analyzing local image statistics in natural scenes

The analysis of natural scenes is schematized in Figure 1. We collect an ensemble of image patches from the calibrated Penn natural image database (PIDB) (Tkačik et al., 2011). We preprocess the image patches as shown in Figure 1A. This involves first averaging pixel luminances over a square region of N × N pixels, which converts an image of size L₁ × L₂ pixels into an image of reduced size L₁/N × L₂/N pixels. Images are then divided into R × R square patches of these downsampled pixels and whitened (see ‘Materials and methods’, Image preprocessing, for further details). Since the preprocessing depends on a choice of two parameters, the block-average factor N and patch size R, we report results for multiple image analyses performed using the identical preprocessing pipeline but for various choices of N and R. After preprocessing, we binarize each patch to have equal numbers of black and white pixels (black = −1, white = +1). We characterize each patch by the histogram of 16 binary colorings (2^2×2) seen through a square 2 × 2 pixel glider (Figure 1B). Translation invariance imposes constraints on this histogram, reducing the number of degrees of freedom to 10 (Victor and Conte, 2012). These degrees of freedom can be mapped to a set of image statistic coordinates that separates correlations based on their order: (i) one first-order coordinate, γ, describes overall luminance, (ii) four second-order coordinates, ${β_{|}$ , $β_{-}$ , $β_{/}$ , $β_{\}}$ , describe two-point correlations between pixels arranged vertically, horizontally, or diagonally, (iii) four third-order coordinates, ${θ_{⌞}$ , $θ_{⌜}$ , $θ_{⌝}$ , $θ_{⌟}}$ , describe three-point correlations between pixels arranged into ⌞-shapes of different orientations, and (iv) one fourth-order coordinate, α, describes the single four-point correlation between all four pixels in the glider (Figure 1C). The binarization step of the preprocessing pipeline forces γ to zero, leaving nine coordinates. Each image patch is thus characterized by a vector of coordinate values ${β_{|}, β_{-}, β_{/}, β_{\}, θ_{⌜}, θ_{⌝}, θ_{⌟}, θ_{⌞}, α}$ , that is, a point within the multidimensional space of image statistics. Accumulating these points across patches yields a multidimensional probability distribution that characterizes the local correlations in natural scenes (schematized in Figure 1D). A total of 724 images (up to 249780 patches, depending on the choice of N and R), was used to construct this distribution.

Figure 1 with 3 supplements see all

Download asset Open asset

Extracting image statistics from natural scenes.

(A) We first block-average each image over N × N pixel squares, then divide it into patches of size *R × R* pixels, then whiten the ensemble of patches by removing the average pairwise structure, and finally binarize each patch about its median intensity value (see ‘Materials and methods’*, Image preprocessing*). (B) From each binary patch, we measure the occurrence probability of the 16 possible colorings as seen through a two-by-two pixel glider (red). Translation invariance imposes constraints between the probabilities that reduce the number of degrees of freedom to 10. (C) A convenient coordinate basis for these 10° of freedom can be described in terms of correlations between pixels as seen through the glider. These consist of one first-order coordinate (γ), four second-order coordinates ( $β_{|}, β_{-}, β_{/}, β_{\}$ ), four third-order coordinates ( $θ_{⌞}, θ_{⌜}, θ_{⌝}, θ_{⌟}$ ), and one fourth-order coordinate (α). Since the images are binary, with black = −1 and white = +1, these correlations are sums and differences of the 16 probabilities that form the histogram in panel B (Victor and Conte, 2012). (D) Each patch is assigned a vector of coordinate values that describes the histogram shown in (B). This coordinate vector defines a specific location in the multidimensional space of image statistics. The ensemble of patches is then described by the probability distribution of coordinate values. We compute the degree of variation (standard deviation) along different directions within this distribution (inset). (E) Along single coordinate axes, we find that the degree of variation is rank-ordered as ${β_{|}, β_{-}} > {β_{/}, β_{\}} > α > {θ_{⌞}, θ_{⌜}, θ_{⌝}, θ_{⌟}}$ , shown separately for different choices of the block-average factor N and patch size R used during image preprocessing.

https://doi.org/10.7554/eLife.03722.003

To summarize this distribution, we compute the degree of variation (standard deviation) along each coordinate axis (Figure 1E). As is shown, the degree of variation along different coordinate axes exhibits a characteristic rank-ordering, given by ${β_{|}, β_{-}} > {β_{/}, β_{\}} > α > {θ_{⌞}, θ_{⌜}, θ_{⌝}, θ_{⌟}}$ ; that is, the most variable correlations are pairwise correlations in the cardinal directions, followed by pairwise correlations in the oblique directions, followed by fourth-order correlations. Interestingly, third-order correlations are the least variable across image patches. An analogous analysis performed on white noise yields a flat distribution with considerably smaller standard deviation values (See ‘Materials and methods’, Analysis variants for Penn Natural Image Database, and Figure 1—figure supplement 3 for comparison), and performing the analysis on a colored Gaussian noise (e.g. $1 / f^{k}$ spectrum) would also yield a flat distribution because of the whitening stage in the image preprocessing pipeline. These (and subsequent) findings are preserved across different choices of image analysis parameters (shown in Figure 1E for block-average factors N = 2, 4 and patch sizes R = 32, 48, 64; see ‘Materials and methods’, Analysis variants for Penn Natural Image Database, and Figure 3—figure supplement 5A for a larger set of parameters) and also across other collections of natural images (see ‘Materials and methods’, Comparison with van Hateren Database, and Figure 3—figure supplement 5B for a parallel analysis of the van Hateren image dataset (van Hateren and van der Schaaf, 1998), which gives similar results).

Characterizing visual sensitivity to local image statistics

To characterize perceptual sensitivity to different statistics, we isolated them in synthetic visual images and used a figure/ground segmentation task (Figure 2B). We used a four-alternative forced-choice task in which stimuli consisted of a textured target and a binary noise background (or vice-versa). Each stimulus was presented for 120ms and was followed by a noise mask. Subjects were then asked to identify the spatial location (top, bottom, left, or right) of the target. Experiments were carried out for synthetic stimuli in which the target or background was defined by first varying image statistic coordinates independently (Figure 2A shows examples of gamuts from which stimuli are built). Along each coordinate axis, threshold (1/sensitivity) was defined as the coordinate value required to support a criterion level of performance (Figure 2C, inset). We then performed further experiments in which the target or background was defined by simultaneously varying pairs of coordinates. For measurements involving each coordinate pair (to which we will refer as a ‘coordinate plane’), we traced out an isodiscrimination contour (Figure 2C) that describes the threshold values not only along the cardinal directions, but also along oblique directions. Measurements were collected for four individual subjects in each of 11 distinct coordinate planes (representing all distinct coordinate pairs up to 4-fold rotational symmetry; see ‘Materials and methods’, Psychophysical methods, for further details). Each subject performed 4320 judgements per plane, for a total of 47,520 trials per subject.

Figure 2

Download asset Open asset

Measuring human sensitivity to image statistics.

(A) Synthetic binary images can be created that contain specified values of individual image statistic coordinates (as shown here) or specified values of pairs of coordinates (Victor and Conte, 2012). (B) To measure human sensitivity to image statistics, we generate synthetic textures with prescribed coordinate values but no additional statistical structure, and we use these synthetic textures in a figure/ground segmentation task (See Victor and Conte, 2012 and ‘Materials and methods’*, Psychophysical methods*). (C) For measurements along coordinate axes, test stimuli are built out of homogeneous samples drawn from the gamuts shown in A (e.g. the target shown in B was generated from the portion of the gamut indicated by the red arrow in A; See ‘Materials and methods’*, Psychophysical methods*, and Victor et al., 2005; Victor and Conte, 2012; Victor et al., 2013). We assess the discriminability of these stimuli from white noise by measuring the threshold value of a coordinate required to achieve performance halfway between chance and perfect (inset). A similar approach is used to measure sensitivity in oblique directions; here, two coordinate values are specified to create the test stimuli. The threshold values along the axes and in oblique directions define an isodiscrimination contour (red dashed ellipse, main panel) in pairwise coordinate planes. (D) Along individual coordinate axes, we find that sensitivities (1/thresholds) are rank-ordered as ${β_{|}, β_{-}} > {β_{/}, β_{\}} > α > {θ_{⌜}, θ_{⌝}, θ_{⌟}, θ_{⌞}}$ , shown separately for four individual subjects. A single set of perceptual sensitivities is shown for $(β_{|}, β_{-})$ , $(β_{/}, β_{\})$ , and $(θ_{⌞}, θ_{⌜}, θ_{⌝}, θ_{⌟})$ , since human subjects are equally sensitive to rotationally-equivalent pairs of second-order coordinates and to all third-order coordinates (Victor et al., 2013).

https://doi.org/10.7554/eLife.03722.007

Figure 2D shows perceptual sensitivities measured along each coordinate axis. For each of four subjects, a similar pattern emerges for sensitivities as was observed for variation in natural image statistics: sensitivities are rank-ordered as ${β_{|}, β_{-}} > {β_{/}, β_{\}} > α > {θ_{⌜}, θ_{⌝}, θ_{⌟}, θ_{⌞}}$ .

Note that the difference between the sensitivities in the horizontal and vertical directions ( $β_{-}$ and $β_{|}$ ) vs the diagonal directions ( $β_{\}$ and $β_{/}$ ) is not simply an ‘oblique effect’, that is, a greater sensitivity to cardinally- vs obliquely-oriented contours (Campbell et al., 1966). Horizontal and vertical pairwise correlations differ from the diagonal pairwise correlations in more than just orientation: pixels involved in horizontal and vertical pairwise correlations share an edge, while pixels involved in diagonal pairwise correlations only share a corner. Correspondingly, the difference in sensitivities for horizontal and vertical correlations vs diagonal correlations is approximately 50%, which is much larger than the size of the classical oblique effect (10–20%) (Campbell et al., 1966).

Natural scenes predict human sensitivity along single coordinates

Figures 1E and 2D show a rank-order correspondence between natural image statistics and perceptual sensitivities. This qualitative comparison can be converted to a quantitative one (Figure 3A), as a single scaling parameter aligns the standard deviation of natural image statistics with the corresponding perceptual sensitivities. In this procedure, each of the six image analyses is scaled by a single multiplicative factor that minimizes the squared error between the set of standard deviations and the set of subject-averaged sensitivities (see ‘Materials and methods’, Image preprocessing, and Figure 3—figure supplement 1 for additional details regarding scaling). The agreement is very good, with the mismatch between image analyses and human psychophysics comparable to the variability from one image analysis to another, or from one human subject to another.

Figure 3 with 9 supplements see all

Download asset Open asset

Variation in natural images predicts human perceptual sensitivity.

(A) Scaled degree of variation (standard deviation) in natural image statistics along second- (β), third- (θ), and fourth-order (α) coordinate axes (blue circular markers) are shown in comparison to human perceptual sensitivities measured along the same coordinate axes (red square markers). Degree of variation in natural image statistics is separately shown for different choices of the block-average factor (N) and patch size (R) used during image preprocessing. Perceptual sensitivities are separately shown for four individual subjects. As in Figure 2C,A single set of perceptual sensitivities is shown for ${β_{|}, β_{-}}$ , ${β_{/}, β_{\}}$ , and ${θ_{⌞}, θ_{⌜}, θ_{⌝}, θ_{⌟}}$ . (B) For each pair of coordinates, we compare the precision matrix (blue ellipses) extracted from natural scenes (using N = 2, R = 32) to human perceptual isodiscrimination contours (red ellipses). Coordinate planes are organized into a grid. The set of ellipses in each pairwise plane is scaled to maximally fill each portion of the grid; agreement between the variation along single coordinate axes and the corresponding human sensitivities (shown in A) guarantees that no information is lost by scaling. Across all 36 coordinate planes, there is a correspondence in the shape, size, and orientation of precision matrix contours and perceptual isodiscrimination contours. (C) Quantitative comparison of a single image analysis (N = 2, R = 32) with the subject-averaged psychophysical data. For single coordinates depicted in A, we report the standard deviation in natural image statistics (upper row) and perceptual sensitivities (middle row). For sets of coordinate planes depicted in (B), we report the (average eccentricity, angular tilt) of precision matrix contours from natural scenes (upper row) and isodiscrimination contours from psychophysical measurements (middle row). The degree of correspondence between predictions derived from natural image data and the psychophysical measurements can be conveniently summarized as a scalar product (see text), where 1 indicates a perfect match. In all cases, the correspondence is very high (0.938–0.999) and is highly statistically significant (p ≤ 0.0003 for both single coordinates and pairwise coordinate planes; see ‘Materials and methods’*, Permutation tests*, for details).

https://doi.org/10.7554/eLife.03722.008

We quantify the correspondence between image analyses and psychophysical analyses by computing the scalar product between the normalized vector of standard deviations (extracted separately from each image analysis) and the normalized vector of subject-averaged sensitivities (extracted from the set of psychophysical analyses). A value of 1 indicates perfect correspondence, and 0 indicates no correspondence. This value ranges from 0.987 to 0.999 across image analyses and is consistently larger than the value measured under the null hypothesis that the apparent correspondence between statistics and sensitivities is chance (p ≤ .0003 for each image analysis; see Tables 1–2 and ‘Materials and methods’, Permutation tests, for details regarding statistical tests).

These findings support our hypothesis that human perceptual sensitivity measured along single coordinate axes (assessed using synthetic binary textures) is predicted by the degree of variation along the same coordinate axes in natural scenes.

Natural scenes predict human sensitivity to joint variations of all pairs of coordinates

The correspondence shown in Figure 3A considers each image statistic coordinate in isolation. However, it is known that image statistics covary substantially in natural images (as diagrammed in Figure 1D) and also that they interact perceptually (as diagrammed in Figure 2C). When pairs of natural image statistics covary, thus sampling oblique directions not aligned with the coordinate axes in the space of image statistics, our hypothesis predicts that human perceptual sensitivity is matched to both the degree and the direction of that covariation (we are referring here to the orientation of a distribution in the coordinate plane of a pair of image statistics, and not to an orientation in physical space). To test this idea, we proceeded as follows.

First, we fit the distribution of image statistics with a multidimensional Gaussian. When projected into pairwise coordinate planes, the isoprobability contours of this Gaussian capture the in-plane shape and orientation of the covariation of the distribution. Along single coordinate axes, the variation in natural image statistics predicts human perceptual sensitivities, as we have shown (Figure 3A). More generally, we would predict that sensitivity should be be high along directions in which the distribution of natural image statistics has high standard deviation, because in those directions, the position of a sample cannot be guessed. Within coordinate planes, the quantitative statement of this idea is that the inverse covariance matrix, or precision matrix, predicts perceptual isodiscrimination contours. Sensitivity is expected to be low (and therefore threshold high) along directions in which the precision matrix has a high value and the position of a sample can be guessed a priori.

Results in each coordinate plane are shown in Figure 3B. Across all subjects and all coordinate planes, we find that the shape and orientation of perceptual isodiscrimination contours (red ellipses) are predicted by the distribution of image statistics extracted from natural scenes (blue ellipses). As in Figure 3A, the correspondence is very good, with mismatch that is comparable to the variability observed across image analyses and across subjects.

To quantify the correspondence between natural image and psychophysical analyses, we describe each ellipse by a single vector $\vec{ω}$ that combines information about shape (eccentricity) and orientation (angular tilt), and we compute the scalar product between the image analysis vector ${\vec{ω}}_{NI}$ and the subject-averaged psychophysical vector ${\vec{ω}}_{PP}$ . This value, averaged across coordinate planes, ranges from 0.953 to 0.977 across image analyses. We compared this correspondence to that obtained under the null hypotheses that (i) the apparent correspondence between image statistic covariances and isodiscrimination contours is chance, or (ii) the apparent covariances in image statistics are due to chance. The observed correspondence is much greater than the value measured under either null hypothesis (p ≤.0003 for each image analysis under both hypotheses; see ‘Materials and methods’, Analysis of image statistics in pairwise coordinate planes, and Figure 3—figure supplement 2 for comparisons of eccentricity and tilt, and Tables 1–3 and ‘Materials and methods’, Permutation tests, for statistical tests).

These findings confirm that the shape and orientation of human isodiscrimination contours, measured across all pairwise combinations of coordinates, can be quantitatively predicted from the covariation of image statistics extracted from natural scenes. The observed correspondence is maintained within the full 9-dimensional coordinate space (see ‘Materials and methods’, Analysis of the full 9-dimensional distribution of image statistics, and Figure 3—figure supplement 3 for principal component analyses, and Tables 1–3 and ‘Materials and methods‘, Permutation tests, for statistical tests), confirming that our hypothesis describes human sensitivity in the full 9-dimensional space of local image statistics extracted from natural scenes.

Discussion

How should neural mechanisms be distributed to represent a diverse set of informative sensory features? We argued that, when performance requires inferences limited by sampling of the statistics of input features, resources should be devoted in proportion to feature variability. A basic idea here is that features that take a wider range of possible values are less predictable, and will better distinguish between contexts in the face of input noise. We used this hypothesis to successfully predict human sensitivity to elements of visual form arising from spatial multi-point correlations in images. This result is notable for several reasons. First, we successfully predicted dozens of independent parameters that describe human perceptual sensitivity. The only free parameter was a scale that converted between perceptual sensitivities and natural image statistics. Moreover, predictions about the rank ordering of sensitivities (Figure 3A) and the shape and orientation of isodiscrimination contours (Figure 3B) do not even require a scale factor. Second, the theoretical predictions and their psychophysical test were derived from two very different sources. Psychophysical stimuli consisted of mathematically-defined synthetic binary textures with precisely-controlled correlational structure that is unlikely to occur outside of the laboratory. In contrast, the efficient coding predictions were derived from calibrated photographs of natural scenes in which many types of correlations are simultaneously present. Third, predictions refer to multi-point (and not just pairwise) correlations, which are critical for defining local features such as lines and edges (Oppenheim and Lim, 1981; Morrone and Burr, 1988). In contrast, previous normative theories have have mainly focused on explaining the linear receptive fields of neurons in primary visual (Olshausen and Field, 1996; Bell and Sejnowski, 1997; van Hateren and Ruderman, 1998; van Hateren and van der Schaaf, 1998; Hyvarinen and Hoyer, 2000; Vinje and Gallant, 2000; Karklin and Lewicki, 2009) and auditory cortex (Carlson and DeWeese, 2002, 2012), or on deriving symmetry- and coverage-based mesoscopic models of cortical map formation in V1 (Wolf and Geisel, 1998; Swindale et al., 2000; Kaschube et al., 2011). Finally, the efficient coding prediction of greater sensitivity to more variable multipoint correlations is closely tied to the statistical structure of natural visual images. Specifically, this regime applies to highly variable multipoint correlations that cannot be predicted from simpler ones. Some other multipoint correlations (defined on configurations other than a 2 × 2 glider) are also highly variable, but they are predictable from simpler correlations. For these multipoint correlations, visual sensitivity is very low (Tkačik et al., 2010), and efficient coding is not applicable in the form proposed here.

In sum, the surprising predictive power and the high statistical significance of our results provide strong support for the proposed application of the efficient coding hypothesis to cortical processing of complex sensory features.

Perceptual salience of multi-point correlations likely arises in cortex

Although we did not record cortical responses directly, several lines of evidence indicate that that the perceptual thresholds we measured are determined by cortical processes. First, the stimuli had high contrast (100%) and consisted of pixels that were readily visible (14 arcmin), so retinal limitations of contrast sensitivity and resolution were eliminated. Second, the task requires pooling of information over wide areas (100–200 pixels, that is, a region whose diameter is 10–15 times the width of an image element; see Figure 7 in Victor and Conte, 2005). Retinal receptive fields are unlikely to do this, as the ratio of their spatial extent (surround size) to their resolution (center size) is typically no more than 4:1 (Croner and Kaplan, 1995; Kremers et al., 1995). Third, to account for the specificity of sensitivity to three- and four-point correlations, a cascade of two linear-nonlinear stages is required (Victor and Conte, 1991); retinal responses are quite well-captured by a single nonlinear stage (Nirenberg and Pandarinath, 2012), and cat retinal populations show no sensitivity to the four-point correlations studied used here (Victor, 1986) while simultaneous cortical field potentials do. Conversely, macaque visual cortical neurons (Purpura et al., 1994), especially those in V2, manifest responses to three- and four-point correlations (Yu et al., 2013).

Cortex faces a different class of challenges than the sensory periphery

Successive stages of sensory processing share the same broad goals: invest resources in encoding stimulus features that are sufficiently informative, and suppress less-informative ones. In the periphery, this is exemplified by the well-known suppression of very low spatial frequencies; in cortex, this is exemplified by insensitivity to high-order correlations that are predictable from lower-order ones. Previous work has shown that such higher-order correlations can be separated into two groups—informative and uninformative—and only the informative ones are encoded (Tkačik et al., 2010). We used this finding to select an informative subspace for the present study, and we asked how resources should be efficiently allocated amongst features within this informative subspace.

A simple model of efficient coding by neural populations is shown in Figure 4A (details in ‘Materials ans methods’, Two regimes of efficient coding). Here, to enable analytical calculations, we used linear filters of variable gain and subject to Gaussian noise to model a population of neural channels encoding different features. The optimal allocation of resources to maximize information transmitted by the population depends on the amount of input noise, the amount of output noise, the input signal variability, and the total resources available to the system, here quantified as a constraint on the total output power (i.e., sum of response variances) in the neural population. The constrained output power and the output noise together determine the ‘bandwidth’ of the system—that is, the expressive capacity of its outputs. Consider a neural population with input noise, output noise, and a fixed amount of output power. We find that when input signal variability is sufficiently large compared to the input noise, the gain of neurons should decrease with the variance of the input (regions to the right of the peaks in the right-hand panel of Figure 4A). This is a regime where the output bandwidth is low compared to the input range, and efficient coding predicts that signals should be ‘whitened’ by equalizing the variance in different channels. Conversely, consider input signals with a smaller range, which are thus more disrupted by input noise. In this case, the gain of neurons should increase with the variance of the input (regions to the left of the peaks in the right-hand panel of Figure 4A). This is a regime where the input noise dominates, and efficient coding predicts that the system should invest more resources in more variable, and hence more easily detectable, input signals. The relative sizes of input and output noise (controlled by $Λ$ in Figure 4A) determines the input ranges over which the two qualitatively different regimes of efficient coding apply.

Figure 4 with 4 supplements see all

Download asset Open asset

Regimes of efficient coding.

(A) To analyze different regimes of efficient coding, we consider a set of channels, where the $k^{t h}$ channel carries an input signal with variability $s_{k}$ . Gaussian noise is added to the input. The result is passed through a linear filter with gain $| L_{k} |$ , and then Gaussian noise is added to the filter output. We impose a constraint on the total power output of all channels, that is, a constraint on its total resources. With these assumptions, the set of gains that maximizes the transmitted information can be determined (see ‘Materials and methods’*, Two regimes of efficient coding*, and (van Hateren, 1992a; Doi and Lewicki, 2011; Doi and Lewicki, 2014)). This set of gains depends on the relative strengths of input and output noise and on the severity of the power constraint, quantified here by the dimensionless parameter $Λ$ (right-hand panel). As $Λ$ decreases from 1 to 0, the system moves from a regime in which output noise is limiting to one in which input noise is limiting. (B) The efficient coding model applied to the sensory periphery. Raw luminances from natural images are corrupted with noise (e.g. shot noise resulting from photon incidence) and passed through a linear filter. The resulting signal is carried by the optic nerve, which imposes a strong constraint on output capacity. In the bandwidth limited case where output noise dominates over input noise (e.g., under high light conditions when photon noise is not limiting), the optimal gain decreases as signal variability increases. Since channel input and channel gain vary reciprocally, channel outputs are approximately equalized, resulting in a ‘whitening’, or decorrelation. (C) The efficient coding model applied to cortical processing. Informative image features resulting from early cortical processing, caricatured by our preprocessing pipeline as applied to the retinal output, are sampled from a spatial region of the image. This sampling acts as a kind of input noise, because it only provides limited count-based estimates for the true statistical properties of the image source. When this input noise is limiting, the optimal gain *increases* as signal variability increases. Rather than whiten, the output signals preserve the correlational structure of the input. Note that in both regimes (B) and (C), there is a range of signals that are not encoded at all. These are the signals that are not sufficiently informative to warrant an allocation of resources.

https://doi.org/10.7554/eLife.03722.018

To make these abstract considerations concrete, we first considered coding in the sensory periphery. A common strategy employed in the periphery is ‘whitening’, where relatively fewer resources are devoted (yielding lower gain) to features with more variation (Olshausen and Field, 1996). As an example, within the spatial frequency range that the retina captures well, sensitivity is greater for high spatial frequencies than for low ones, that is, sensitivity is inversely related to the degree of variation in natural scenes (the well-known $1 / f^{2}$ power spectrum [Olshausen and Field, 1996]). Figure 4B illustrates how this strategy can emerge from the simple efficient coding scheme discussed above as applied to peripheral sensory processing. Spatiotemporal correlations of light undergo filtering before passing through the optic nerve bottleneck (a constraint on bandwidth). Such a constraint on bandwidth is equivalently understood as a regime where output noise is relatively large compared to input noise. In this limit, where output noise dominates over input noise, the optimal strategy is whitening (See Srinivasan et al., 1982 and Figure 4A). Of course, real neural systems contend with both input and output noise; indeed recent work has shown that simply whitening to deal with output noise underestimates the optimal performance that the sensory periphery can achieve (Doi and Lewicki, 2014).

An alternative regime arises when input noise limits performance. In this regime, relatively more resources are devoted to features with more variation. This regime was discussed in early work of van Hateren, (1992a) and was also recognized in (Doi and Lewicki, 2011, 2014), although it has received much less attention than the ‘whitening’ regime. Our results suggest that this is the regime is relevant to cortex, where it predicts the relative allocation of resources to higher-order image statistics. Figure 4C illustrates the simple efficient coding scheme in this context. We use our image preprocessing pipeline to mimic early visual processing, and we consider the downstream coding of higher-order image features. Because these features must be sampled from a finite patch of an image, they are subject to input noise arising from fluctuations in statistical estimation. When such input noise is limiting, the ability to detect a signal from noise increases with the variability of that signal. In this limit, efficient coding predicts that resources should be allocated in proportion to feature variability (Figure 4C). This captures the intuition that when signal reliability is in question, more reliable signals warrant more resources. Furthermore, if two or more channels have covarying signals, resources should be devoted in relation to the direction and degree of maximum covariance (see ‘Materials and methods’, Two regimes of efficient coding, Figure 4—figure supplement 3, and Figure 4—figure supplement 4).

The difference between these two efficient coding regimes is a consequence of the form of noise—output vs input noise—that is limiting. Our finding that cortex operates in a different regime than the well-known peripheral whitening reflects the fact that different stages and kinds of processing can face different constraints. While information transmission by the visual periphery is limited by a bottleneck in the optic nerve, cortex faces no such transmission constraint. Furthermore, while faithful encoding may be an immediate goal of early visual processing, cortical circuits have to interpret image features from a complex and crowded visual scene and perform statistical inference. For example, to discriminate between various textures, the cortex cannot perform pixel-by-pixel comparisons, but must rely on the estimation of local correlations (image statistics) instead. Because these correlations must be sampled from a finite patch of the visual scene, any estimate will be limited by sampling fluctuations.

Sampling constraints vs resource constraints

Sampling fluctuations constitute a source of input noise, the magnitude of which depends on the size of the sampled region. For natural images, this gives rise to a tradeoff: small regions lead to large fluctuations in the estimated statistics, while large regions blur over local details. This blurring may obscure the boundaries between objects with different surface properties. While the brain must implement such sampling, the size, scale, and potentially dynamic nature of the sampling region is not known. Interestingly, our predictions of human sensitivities do not change substantially over a wide range of spatial scales and image patch sizes, perhaps reflecting a scaling property of natural images (Stephens et al., 2013). An avenue for future research is to determine whether there is an optimal region size, and if so, whether it could be estimated from images themselves.

Sampling limitations alone do not suffice to account for the observed differential sensitivity of the brain to local image statistics. Were sampling limitations the only consideration, perceptual sensitivity would be the same along each coordinate axis, and perceptual isodiscrimination contours would be circular in each coordinate plane. This follows from an ideal observer calculation (See Appendix B of Victor and Conte, 2012). In contrast, we find that human observers have a severalfold variability in sensitivity along different coordinate axes (Figure 3A) and have isodiscrimination contours that are elongated in oblique directions (Figure 3B). The efficient coding principle can account for these findings by taking into consideration the fact that a real observer has finite processing resources. In this context (finite resources and substantial input noise), the efficient coding principle predicts that resources are invested in relation to the range of signal values that are typically present (van Hateren, 1992a), as we find. Interestingly, resource limitations seem to play an important role in the cortex despite the vast expansion in the number of neurons compared to the optic nerve. Presumably, this reflects the large number of complex features that could be computed and the corresponding need for a large overrepresentation of the stimulus space (Olshausen and Field, 1997).

Clues to neural mechanisms

While we find a close match between the variation in natural image statistics and human psychophysical performance, some aspects of the distribution of natural image statistics do not match psychophysical data.

These differences are not readily apparent when we examine the variances and covariances (Figure 3) of the distribution of natural image statistics but emerge only when one considers its detailed shape (see ‘Materials and methods’, Asymmetries in distributions of natural image statistics). For example, the distribution of α-coordinate values has a longer tail in the positive vs negative direction (see Figure 3—figure supplement 9 and (Tkačik et al., 2010)). In contrast, human perceptual sensitivity is symmetric, or very nearly so (within $\sim 20 %$ ), for positive vs negative values of α (Victor et al., 2005; Victor and Conte, 2012; Victor et al., 2013). This suggests that limitations imposed by ‘neural hardware’ force the system to use heuristics instead of matching the natural image distribution exactly. For example, an opponent mechanism responsible for detecting variations along, example, the α coordinate, might be a useful and easy (although imperfect) way to process the asymmetric distribution of four-point correlations found in natural scenes. Such a mechanism could be matched to the variance of the natural image distribution along the α coordinate, but not to its skew or other odd moments. An opponent mechanism would necessarily give rise to equal sensitivities to positive vs negative values of α, as observed in psychophysical results. Further study of deviations from a perfect match to the distribution of natural image statistics might provide additional insight into these or other possible neural mechanisms, and into the goals of the computations. Independently, our results also raise an interesting theoretical question about the optimal representation of non-gaussian, multidimensional signals under resource-limited conditions.

Outlook

Looking forward, we hypothesize that the principle of efficient coding might apply to cortical processing at higher levels. For example, more complex image features, such as shapes, are represented as conjunctions of contour fragments (Brincat and Connor, 2004), where each contour fragment is a local image object defined by particular multi-point correlations. We might speculate that the joint statistics of contour fragments in natural scenes can predict, through appropriate formulation of the same efficient coding principle used here, the properties of neurons in area IT (Hung et al., 2012; Yau et al., 2012) or the associated perceptual sensitivities of human observers.

Finally, although we have focused on perception of image statistics, we do this with the premise that this process is in the service of inferring the materials and objects that created an image and ultimately, guiding action. Thus, it is notable that we found a tight correspondence between visual perception and natural scene statistics without considering a specific task or behavioral set. Indeed, the emergence of higher-order percepts without explicit task specification was the original hope of the efficient coding framework as first put forward by Barlow and Attneave (Attneave, 1954; Barlow, 1959, 1961). Doubtless, these ‘top-down’ factors also influence the visual computations that underlie perception, and the nature and site of this influence are an important focus of future research.

Measures of overlap	Image analysis		Observed overlap	Shuffled overlap Values				Significance
Measures of overlap	Image analysis		Observed overlap	Mean	std	min	max	Significance
Range/Sensitivity ${\vec{σ}}_{NI} \cdot {\vec{s}}_{PP}$	N = 2	R = 32	0.999	0.859	0.9 × 10⁻¹	0.704	0.983	<0.04
		R = 48	0.993	0.832	1.1 × 10⁻¹	0.651	0.978	<0.04
		R = 64	0.987	0.809	1.1 × 10⁻¹	0.614	0.974	<0.04
	N = 4	R = 32	0.998	0.825	1.1 × 10⁻¹	0.638	0.969	<0.04
		R = 48	0.994	0.812	1.1 × 10⁻¹	0.646	0.990	<0.04
		R = 64	0.991	0.794	1.1 × 10⁻¹	0.617	0.985	<0.04
Inverse Range/Threshold $〈 {\vec{ω}}_{NI} \cdot {\vec{ω}}_{PP} 〉$	N = 2	R = 32	0.971	0.709	1.5 × 10⁻¹	0.508	0.924	<0.04
		R = 48	0.969	0.692	1.6 × 10⁻¹	0.469	0.924	<0.04
		R = 64	0.953	0.685	1.7 × 10⁻¹	0.450	0.913	<0.04
	N = 4	R = 32	0.967	0.679	1.7 × 10⁻¹	0.447	0.908	<0.04
		R = 48	0.975	0.632	1.5 × 10⁻¹	0.400	0.880	<0.04
		R = 64	0.977	0.648	1.6 × 10⁻¹	0.411	0.894	<0.04
Fractional Principal Components ${\vec{f}}_{NI} \cdot {\vec{f}}_{PP}$	N = 2	R = 32	0.994	0.382	1.5 × 10⁻¹	0.160	0.657	<0.04
		R = 48	0.995	0.485	1.2 × 10⁻¹	0.287	0.727	<0.04
		R = 64	0.991	0.487	0.7 × 10⁻¹	0.372	0.632	<0.04
	N = 4	R = 32	0.995	0.459	1.4 × 10⁻¹	0.238	0.732	<0.04
		R = 48	0.996	0.444	1.0 × 10⁻¹	0.277	0.601	<0.04
		R = 64	0.996	0.450	1.1 × 10⁻¹	0.279	0.614	<0.04
Full Principal Components $〈 {\vec{F}}_{NI} \cdot {\vec{F}}_{PP} 〉$	N = 2	R = 32	0.917	0.316	1.3 × 10⁻¹	0.123	0.578	<0.04
		R = 48	0.828	0.401	1.0 × 10⁻¹	0.228	0.611	<0.04
		R = 64	0.911	0.363	0.7 × 10⁻¹	0.282	0.532	<0.04
	N = 4	R = 32	0.882	0.376	1.2 × 10⁻¹	0.180	0.618	<0.04
		R = 48	0.917	0.362	1.0 × 10⁻¹	0.201	0.520	<0.04
		R = 64	0.919	0.357	1.0 × 10⁻¹	0.196	0.522	<0.04

Measures of overlap	Image analysis		Observed overlap	Shuffled overlap Values				Significance
Measures of overlap	Image analysis		Observed overlap	Mean	std	min	max	Significance
Range/Sensitivity ${\vec{σ}}_{NI} \cdot {\vec{s}}_{PP}$	N = 2	R = 32	0.999	0.806	6.8 × 10⁻²	0.659	0.999	0.0003
		R = 48	0.993	0.775	7.7 × 10⁻²	0.610	0.993	<0.0001
		R = 64	0.987	0.762	8.0 × 10⁻²	0.579	0.987	<0.0001
	N = 4	R = 32	0.998	0.828	6.0 × 10⁻²	0.707	0.998	<0.0001
		R = 48	0.994	0.798	7.1 × 10⁻²	0.660	0.994	0.0002
		R = 64	0.991	0.780	7.6 × 10⁻²	0.630	0.991	<0.0001
Inverse Range/Threshold $〈 {\vec{ω}}_{NI} \cdot {\vec{ω}}_{PP} 〉$	N = 2	R = 32	0.971	0.693	8.1 × 10⁻²	0.499	0.972	0.0002
		R = 48	0.969	0.682	8.4 × 10⁻²	0.476	0.969	0.0003
		R = 64	0.953	0.671	8.5 × 10⁻²	0.446	0.954	0.0002
	N = 4	R = 32	0.967	0.696	7.6 × 10⁻²	0.521	0.964	<0.0001
		R = 48	0.975	0.692	8.0 × 10⁻²	0.509	0.976	0.0002
		R = 64	0.977	0.689	8.2 × 10⁻²	0.493	0.978	0.0003
Fractional Principal Components $〈 {\vec{f}}_{NI} \cdot {\vec{f}}_{PP} 〉$	N = 2	R = 32	0.994	0.592	1.2 × 10⁻¹	0.271	0.995	0.0003
		R = 48	0.995	0.604	1.3 × 10⁻¹	0.281	0.995	0.0004
		R = 64	0.991	0.591	1.2 × 10⁻¹	0.278	0.991	0.0003
	N = 4	R = 32	0.995	0.590	1.2 × 10⁻¹	0.218	0.995	0.0001
		R = 48	0.996	0.577	1.2 × 10⁻¹	0.251	0.996	0.0002
		R = 64	0.996	0.581	1.2 × 10⁻¹	0.266	0.996	0.0004
Full Principal Components $〈 {\vec{F}}_{NI} \cdot {\vec{F}}_{PP} 〉$	N = 2	R = 32	0.917	0.391	1.2 × 10⁻¹	0.100	0.927	0.0002
		R = 48	0.828	0.391	1.2 × 10⁻¹	0.086	0.856	0.0008
		R = 64	0.911	0.396	1.2 × 10⁻¹	0.120	0.953	0.0003
	N = 4	R = 32	0.882	0.381	1.2 × 10⁻¹	0.066	0.989	0.0003
		R = 48	0.917	0.380	1.2 × 10⁻¹	0.090	0.902	<0.0001
		R = 64	0.919	0.387	1.2 × 10⁻¹	0.095	0.937	0.0004

Comparisons	Image analysis		Observed overlap	Shuffled overlap Values				Significance
Comparisons	Image analysis		Observed overlap	Mean	std	min	max	Significance
Inverse Range/Threshold $〈 {\vec{ω}}_{NI} \cdot {\vec{ω}}_{PP} 〉$	N = 2	R = 32	0.971	0.924	0.70 × 10⁻³	0.921	0.926	<0.0001
		R = 48	0.969	0.921	1.1 × 10⁻³	0.917	0.925	<0.0001
		R = 64	0.953	0.912	1.3 × 10⁻³	0.908	0.917	<0.0001
	N = 4	R = 32	0.967	0.919	1.7 × 10⁻³	0.914	0.926	<0.0001
		R = 48	0.975	0.922	1.9 × 10⁻³	0.916	0.930	<0.0001
		R = 64	0.977	0.924	2.8 × 10⁻³	0.916	0.935	<0.0001
Fractional Principal Components $〈 {\vec{f}}_{NI} \cdot {\vec{f}}_{PP} 〉$	N = 2	R = 32	0.994	0.806	9.1 × 10⁻⁶	0.806	0.806	<0.0001
		R = 48	0.995	0.806	8.3 × 10⁻⁶	0.806	0.806	<0.0001
		R = 64	0.991	0.806	3.7 × 10⁻⁶	0.806	0.806	<0.0001
	N = 4	R = 32	0.995	0.807	2.5 × 10⁻⁴	0.806	0.809	<0.0001
		R = 48	0.996	0.807	4.1 × 10⁻⁴	0.806	0.810	<0.0001
		R = 64	0.996	0.807	3.5 × 10⁻⁴	0.806	0.810	<0.0001
Full Principal Components $〈 {\vec{F}}_{NI} \cdot {\vec{F}}_{PP} 〉$	N = 2	R = 32	0.917	0.448	5.8 × 10⁻²	0.406	0.596	<0.0001
		R = 48	0.828	0.502	5.9 × 10⁻²	0.408	0.675	<0.0001
		R = 64	0.911	0.458	4.8 × 10⁻²	0.407	0.591	<0.0001
	N = 4	R = 32	0.881	0.489	4.9 × 10⁻²	0.409	0.638	<0.0001
		R = 48	0.917	0.454	3.0 × 10⁻²	0.408	0.637	<0.0001
		R = 64	0.919	0.492	4.2 × 10⁻²	0.411	0.648	<0.0001

Share this article

Cite this article

Extracting image statistics from natural scenes.

Measuring human sensitivity to image statistics.

Variation in natural images predicts human perceptual sensitivity.

Regimes of efficient coding.

Author details

Ann M Hermundstad

Contribution

For correspondence

Competing interests

John J Briguglio

Contribution

Competing interests

Mary M Conte

Contribution

Competing interests

Jonathan D Victor

Contribution

Contributed equally with

Competing interests

Vijay Balasubramanian

Contribution

Contributed equally with

Competing interests

Gašper Tkačik

Contribution

Contributed equally with

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading