Introduction

Vision science is often concerned with what things look like (appearance), but a long and fruitful thread of research has investigated what humans cannot see, that is, the information they are insensitive to. Perceptual metamers — images that are physically distinct but perceptually indistinguishable — provide direct evidence of such information loss in visual representations. Cohen and Kappauf (1985) identify this concept in the writings of Isaac Newton, who noted that particular colors of light could be created by mixing other colors together, and trace the word “metamer” to a 1919 chapter by Wilhelm Ostwald, the first Nobel laureate in chemistry. Color metamers were instrumental in the development of the Young-Helmholtz theory of trichromacy (Helmholtz, 1852). In this context, metamers clarified human sensitivity to light wavelengths, demonstrating that the human visual system projects the infinite dimensionality of the physical signal to three dimensions. It took more than a century before the physiological basis for this — the three classes of cone photoreceptor — was revealed experimentally (Schnapf et al., 1987).

The visual system also discards a great deal of spatial detail, more so in portions of the visual field farthest from the center of gaze. Specifically, the reduction of visual capabilities with increasing eccentricity has been demonstrated for both acuity (Frisen and Glansholm, 1975) and contrast sensitivity (Banks et al., 1987; Robson and Graham, 1981; Rovamo et al., 1978), and is reflected in the physiology: fewer cortical resources are dedicated to the periphery (Schwartz, 1977) and receptive fields in all stages of the visual hierarchy grow with eccentricity (e.g., Daniel and Whitteridge (1961); Dacey and Petersen (1992); Gattass et al. (1981, 1988); Maunsell and Newsome (1987); Wandell and Winawer (2015)). This decrease in acuity has been demonstrated by either scaling the size of features, as in the Anstis eye chart (Anstis, 1974), or by progressively blurring the image, as in Anstis (1998); Thibos (2020). More generally, one can explain this decreasing sensitivity to spatial information with “pooling models”, which compute local averages of image features in windows that grow larger with eccentricity (Balas et al., 2009; Freeman and Simoncelli, 2011; Keshvari and Rosenholtz, 2016). These models assume that peripheral representations are qualitatively similar to those of the fovea: the same local computations are performed over larger regions. Here, we test two such models — one that averages local luminance (luminance model) and one that averages both local spectral energy and luminance (energy model). We generate images pairs in which one or both images have been manipulated such that the two are model metamers (images that are physically distinct but with identical model representations). The pair of model metamers are also perceptual metamers if the human visual system is insensitive to the differences between them, as schematized in figure 1 (see Watson et al. (1986) for an analogous presentation with respect to sensitivity to spatial and temporal frequency). By comparing model and human perceptual metamers, we investigate how well the models’ sensitivities (and insensitivities) align with those of the human visual system.

Schematic diagram of perceptual metamers. Each panel shows a two-dimensional depiction of the set of all possible images: every image corresponds to a point in this space and every point in this space represents an image. Perceptual metamers — images that cannot be reliably distinguished — are equivalence classes in this space and we illustrate this by partitioning the space into distinct perceptually identical regions. Left: example image (black point), and surrounding metameric images (region enclosed by black polygon). Center: In a hierarchical visual system, in which each stage transforms the signals of the previous stage and discards some information, metamers of earlier stages are nested within metamers of later stages. That is, every pair of images that are metamers for an early visual area 𝒩1 are also metamers for the later visual area 𝒩2. The converse does not hold: there are images that 𝒩1 can distinguish but that 𝒩2 cannot. Right: Two particular image families: Samples of white noise (distribution represented as a grayscale intensity map in the lower right corner) and the set (manifold) of natural images (distribution represented by the curved gray line). Typical white noise samples fall within a single perceptual metamer class (humans are unable to distinguish them). Natural images, on the other hand, are generally distinguishable from each other, as well as from un-natural images (those that lie off the manifold).

This procedure rests on the assumption that the visual system processes information hierarchically, in a sequence of stages, and information discarded in early stages cannot be recovered by later stages (the “data-processing inequality”)1. For example, metameric color stimuli produce identical cone responses and thus they cannot be distinguished by any additional downstream neural processing. Perceptually, if two images generate identical responses in all neurons at any stage of processing preceding that of perceptual decision making (e.g., retinal ganglion cells or primary visual cortical neurons), they will appear identical. This is schematized in the central panel of figure 1: two images are perceptual metamers if they are indistinguishable to an early visual area 𝒩1 or if this early stage is sensitive to their differences, but those differences are discarded by later stages in the hierarchy. A number of authors have created perceptual metamers by matching complex statistics thought to be represented at a high level of the visual processing hierarchy (Freeman and Simoncelli, 2011; Keshvari and Rosenholtz, 2016; Wallis et al., 2019; Jagadeesh and Gardner, 2022; Feather et al., 2019) and at the level of photoreceptors. However, there are fewer examples of perceptual metamers that match simpler image statistics.

Here, we synthesize metamers for two different models and measure their perceptual discriminability. The two models are based on local pooling of luminance and spectral energy, respectively, which can be loosely associated with two stages of visual physiology. Specifically, retinal ganglion cells encode local light level, and V1 cells encode local spectral energy. For each set of natural images, we used a stochastic gradient descent method to generate model metamers for both models, and measured discrimination capabilities of human observers when comparing the metamers with their corresponding original images, as well as with each other. We also examined the influence of the initial image used for the metamer synthesis algorithm. The two types of models and multiple types of comparisons shed light on and raise new questions about what makes images distinguishable.

Results

Foveated pooling models

We constructed foveated models of human perception that capture sensitivity to local luminance and spectral energy (see figure 3). Both models are “pooling models” (Balas et al., 2009; Freeman and Simoncelli, 2011; Keshvari and Rosenholtz, 2016; Wallis et al., 2019), which compute statistics as weighted averages over local windows. A specific pooling model is characterized by both the statistics that are pooled and the shapes/sizes of the pooling regions. In the human visual system, receptive field sizes grow proportionally with distance from the fovea, as has been documented in monkey physiology and human fMRI (e.g., Gattass et al. (1981); Wandell and Winawer (2015)). We reduce this to a single scaling parameter, by assuming smooth overlapping pooling regions that are separable and of constant size when expressed in polar angle and log-eccentricity (“log-polar”, which corresponds to the approximate log-polar geometry of visual cortical maps (Schwartz, 1977)). The value of this parameter, along with the choice of statistics, determine the sets of model metamers; for a given set of statistics, increasing the scaling value will increase the size of the sets of metameric images, in a nested manner (figure 2).

Left: Pooling models are parameterized by their scaling value, s, and the statistics they pool, θ. Like any system that discards information, including the human visual system, these models have metamers: sets of images that are perceived as identical, represented graphically as enclosed non-overlapping regions. We draw samples from these sets using an optimization procedure: starting with initial image Ii we adjust the pixel values until their pooled statistics match those of the target (original) image T. The synthetic image depends on the target image, the metamer model, and also on the initial point and the stochastic synthesis algorithm. For a given set of statistics θ, increasing the scaling value will increase the size of the metamer set in a nested manner: any image that is a metamer for s,θ will also be a metamer for αs,θ, for factor α > 1. Right: Changing the set of pooled statistics from θ to ϕ will result in different sets of model metamers, which may or may not overlap with those of the original model (though both must include the target image, T). If the model’s metamer classes differ substantially in their shape from the perceptual metamer classes, they will not provide a good description of the perceptual metamers at critical scaling (e.g., the blue ellipse contains only a small portion of the surrounding perceptual metamer region).

Two pooling models. Both models compute local image statistics weighted by a Gaussian that is separable in a log-polar coordinate system, such that radial extent is approximately twice the angular extent (half-maximum levels indicated by red contours). Windows are laid out in a log-polar grid, with peaks separated by one standard deviation. A single scaling factor governs the size of all pooling windows. The luminance model (top) computes average luminance, approximating the spatial pooling performed by retinal ganglion cells. The spectral energy model (bottom) computes average spectral energy at 4 orientation and 6 scales, as well as luminance, for a total of 25 statistics per window, approximating the representation of complex cells in primary visual cortex (V1). Spectral energy is computed using the complex steerable pyramid constructed in the Fourier domain (Simoncelli and Freeman, 1995), squaring and summing across the real and imaginary components. Full resolution version of this figure can found on the OSF.

We implemented log-polar pooling windows, with size proportional to their distance from the fovea. Like previous studies (Freeman and Simoncelli, 2011; Wallis et al., 2019), these pooling windows are overlapping and radially-elongated (the radial extent is roughly twice the angular extent), but unlike previous studies, we use Gaussian profiles with more extensive overlap, yielding a smoother representation and higher-quality synthesized images. The proportional overlap between adjacent windows is chosen to alleviate ringing and blocking artifacts in the synthesis, and is held fixed at all scaling values.

In the current study, we examine two models that compute different statistics within their pooling windows. The luminance model pools pixel intensities, and thus, a pair of luminance model metamers have the same average luminance within each pooling windows. Since the responses are insensitive to the highest frequencies, luminance model metamers include blurred versions of the target image (in which high frequencies are discarded), but also variants of the target image in which high frequencies are randomized or even amplified. In general, synthesized luminance model metamers will inherit the high frequency information of their initialization image, as can be seen in figure 4. The middle row of figure 4 shows model metamers computed with two example scaling values. While the high-scaling model metamer is clearly perceptually distinct from the target image (regardless of observer fixation location), the low-scaling image is not easily discriminated from the target when fixating at the center of the image (i.e., when the human fovea is aligned with the model fovea).

Example synthesized model metamers. Top: Target image. Middle: Luminance model metamers, computed for two different scaling values (values as indicated, red ellipses to right of fixation indicate pooling window contours at half-max at that eccentricity). The left image is computed with a small scaling value, and is a perceptual metamer for most subjects: when fixating at the cross in the center of the image, the two images appear perceptually identical to the target image. Note, however, that when fixating in the periphery (e.g., the blue box), one can clearly see that the image differs from the target (see enlarged versions of the foveal and peripheral neighborhoods to right). The right image is computed with a larger scaling value, and is no longer a perceptual metamer (for any choice of observer fixation). Bottom: Energy model metamers. Again, the left image is computed with a small scaling value and is a perceptual metamer for most observers when fixated on the center cross. Peripheral content (e.g., blue box) contains more complex distortions, readily visible when viewed directly. The right image, computed with a large scaling value, differs in appearance from the target regardless of observer fixation. Full resolution version of this figure can be found on the OSF.

The spectral energy model pools the squared outputs of oriented bandpass filter responses at multiple scales and orientations. It also pools the pixel intensities. The energies are computed using a complex steerable pyramid, which decomposes images into frequency channels selective for 6 different scales and 4 different orientations. Energy is computed by squaring and summing across the real and imaginary responses (arising from even- and odd-symmetric filters) within each channel. These energies, along with the luminances, are then averaged within the spatial pooling windows. Thus, a pair of energy model metamers have the same average oriented energy and luminance within each of these windows. The bottom row of figure 4 shows model metamers for two different scaling values. The low scaling value for the energy model is approximately matched to the higher scaling value for the luminance model, while the higher scaling value is approximately that associated with V1 receptive fields (Freeman and Simoncelli, 2011). The high-scaling model metamer is perceptually distinct from the target image, and also perceptually distinct from the high-scaling luminance model metamer. The low-scaling model metamer, on the other hand, is difficult to distinguish from the original image (when fixating at the center), but is readily distinguished when one fixates peripherally.

The appearance of these two high-scaling metamers reflects the measurements that the models are matching and the seed images used for synthesis. The luminance model matches average pixel intensity, but has no constraints on spatial frequency, and thus its metamers retain the high frequencies present in the initial white-noise images. The energy model, on the other hand, matches the average contrast energy at all scales and orientations, but discards exact position information (which depends on phase structure). Hence, unlike the luminance model metamers, it reduces the high frequency power to match the typical content of natural images. Instead, it essentially scrambles the phase spectrum, leading to the cloud-like appearance of its metamers.

As can be seen in figure 4, both models can generate perceptual metamers. More generally, all pooling models can generate perceptual metamers if the scaling value is made sufficiently small (in the limit as scaling goes to zero, the model metamers must be identical to the target, in every pixel). For statistics that capture the relevant features for responses at some stage of the visual system, metamers can be achieved with windows whose size is matched to that of the underlying visual neurons. The maximal scaling at which synthetic images are perceptual metamers is thus highly dependent on the choice of underlying statistics: in our examples, the energy model perceptual metamer (Fig. 4, bottom left) is generated with a scaling value about six times larger than that for the luminance model perceptual metamer (middle left), and about five times smaller than those found in Freeman and Simoncelli (2011) using texture statistics. The goal of the present study is to use psychophysics to find the largest scaling value for which these two models generate perceptual metamers, known as the critical scaling.

Psychophysical experiment

We synthesized model metamers matching 20 different natural images (the target images) collected from the authors’ personal collections, as well as from the UPenn Natural Image Database (Tkačik et al., 2011), and an unpublished collection by David Brainard. The images were chosen to span a variety of natural image content types, including buildings, animals, and natural textures (figure 5). Model metamers were generated via gradient descent on the squared error between target and synthetic pooled statistics, and initialized with either an image of white noise or another image drawn from the set of target images.

Target images used in the experiments. Images contain a variety of content, including textures, objects, and scenes. All are relatively high-resolution RAW camera images, with values proportional to luminance and quantized to 16 bits. Images were converted to grayscale, cropped to 2048 x 2600 pixels, displayed at 53:6 x 42:2 degrees, with intensity values rescaled to lie within the range of [0.05, 0.95] of the display intensities. All subjects saw target images 1-10, half saw 11-15, and half saw 16-20. A full resolution version of this figure can be found on the OSF.

In the experiments, observers discriminated two grayscale images, of size 53.6 by 42.2 degrees, sequentially displayed. Each image was presented for 200 msecs, separated by a 500 msec interval in which the screen was blank (mid gray). Each image was separated into two halves by a superimposed vertical bar (mid gray, 2 deg wide, see figure 15). One side, selected at random on each trial, was identical in the two images, while the other differed (e.g., one interval contains the target image, the other a synthesized model metamer). After the second image, a midgray screen appeared with text prompting the observer to report which side of the image had changed.

Critical scaling is four times smaller for the luminance than the energy model

We fit the behavioral data using the 2-parameter function introduced in Freeman and Simoncelli (2011), estimating the critical scaling (sc) and maximum d′ (α) parameters with a Markov Chain Monte Carlo procedure and a hierarchical, partial-pooling model similar to that used by Wallis et al. (2019).

For a given model and comparison, performance increases monotonically with scaling, and is fit well by this particular psychometric function (figure 6A). The exception is the synthesized vs. synthesized comparison for the luminance model, where performance remains poor at all scales (see next section). In the original vs. synthesized cases (for both models), performance is near chance for the smallest tested scaling values tested and exceeds 90% for the largest. The critical scaling values, as seen in figure 6B are approximately 0.016 for the luminance model and 0.06 for the energy model. For comparison, we show the approximate scaling values for the receptive field diameters of retinal ganglion cells (both Midget, and Parasol), as well as for V1 cells (see appendix 2 for details on retinal ganglion cell scaling, and Freeman and Simoncelli (2011) for V1 scaling). The critical scaling for the luminance model falls between the two types of retinal ganglion cell. The critical scaling for the energy model is larger than that of both ganglion cell types, and approximately half of the lower end of the V1 range.

Performance and psychophysical curve parameters values values for different models and image comparisons. The luminance model has a substantially smaller critical scaling than the energy model, and original vs. synthesized comparisons yield smaller critical scaling values than synthesized vs. synthesized comparisons. (A) Psychometric functions, expressing probability correct as a function of scaling parameter, for both energy and luminance models (aqua and beige, respectively), and original vs. synthesized (solid line) and synthesized vs. synthesized (dashed line) comparisons. Data points represent average values across subjects and images, 4320 trials per data point except for luminance model synthesized vs. synthesized comparison, which have only 180 trials per data point (one subject, five images). Lines represent the posterior predictive means of fitted curves across subjects and images, with the shaded region indicating the 95% high-density interval (HDI, Kruschke (2015)). Horizontal bars (below dashed line at 0.5) indicate the range of physiological scaling values for the associated retinal ganglion cell type or cortical area. (B) Estimated parameter values, separated by image (left) or subject (right). Top row shows the critical scaling value and the bottom the value of the maximum d′ parameter. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed lines and shaded regions the global means and 95% HDI. Note that the luminance model, synthesized vs. synthesized comparison is not shown, because the data are poorly fit (panel A, beige dashed line).

Critical scaling is lower for original vs. synthesized than synthesized vs. synthesized comparisons

For both luminance and energy models, it is generally easier to distinguish an original image from a synthesized image than to distinguish two synthesized images initialized with different white noise seeds (with same target image and scaling value). For the luminance model, discrimination of two synthesized images is nearly impossible at all scaling values. For the energy model, discriminating two synthesized images is possible but difficult, with performance only approaching 60%, on average (although note that there are substantial differences across subjects, see figure 6B and appendix 6). The critical scaling value for this comparison, 0.25, is close to the physiological scaling value estimated for V1, and comparable to that reported in Freeman and Simoncelli (2011). The asymptotic performance, however, is much lower in our data. We attribute this to experimental differences (see appendix section 4).

The difficulty of differentiating between two synthesized images is striking, as illustrated in Figure 7. In the limit of global pooling windows, luminance metamers are global samples of white noise which cannot be distinguished (Wallis et al. (2019) made a similar point when discussing their use of the original vs. synthesized task). Analogously, synthesis with the energy model forces local orientated spectral energy to match, without explicitly constraining the phase. Two instances of phase scrambling within peripheral windows are not easily discriminable, even though either of the two might be discriminable from an image with more structure.

Comparison of two synthesized metamers is more difficult than comparison of a synthesized metamer with the original image. For the highest tested scaling value (1.5) the original vs. synthesized comparison is trivial while the synthesized vs. synthesized comparison is difficult (energy model) or impossible (luminance model). Top: target image. Middle: Two luminance model metamers, generated from different initial uniform noise images. Bottom: Two energy model metamers, generated from different initial uniform noise images. All four of the model metamers can be easily distinguished from the natural image at top (original vs. synthesized), but are difficult to distinguish from each other, despite the fact that their pooling windows have grown very large (synthesized vs. synthesized). Full resolution version of this figure can be found on the OSF.

The interaction between model sensitivities and image content affects performance

To the extent that the models capture something important about human perception, image pairs that are model metamers will be perceptual metamers, and hence discrimination should be at chance. Neither model offers predictions of perceptual discriminability (they are deterministic, and do not specify any method of decoding or comparing stimuli). Consistent with this, the critical scaling, which measures the point at which image pairs become indistinguishable, does not vary much across images for a given model and comparison, unlike performance at super-threshold scaling values and the asymptotic levels of d′ (figure 6B). Variations in max d′ are especially clear in the images-specific psychometric functions for the original vs. synthesized energy model comparison (figure 8). Specifically, for the llama image, performance only rises slightly above chance, even at very large scaling windows. The respective target images in panel B suggest an explanation: much of the llama image is cloud-like pink noise, while the nyc image is full of sharp edges in the cardinal directions, with arise from precise alignment of phases across positions and scales. As discussed above, synthetic energy model metamers have matching local oriented spectral energy, with randomized phase information; in order to generate sharp, elongated contours for the buildings of nyc, the windows must be very small. Conversely, the appearance of the llama is captured even when the pooling windows are large. Thus, when scaling is larger than critical scaling, some comparisons become easy and some do not. However, this pattern depends on the interaction between the model’s sensitivities and the target image content, see appendix 6 for more details.

The interaction between image content and model sensitivities greatly affects asymptotic performance, most noticeably on the synthesized vs. synthesized comparison for the energy model, while critical scaling does not vary as much. (A) Performance for each image, averaged across subjects, comparing synthesized images to natural images. Most images show similar performance, with one obvious outlier whose performance never rises above 60%. Data points represent the average across subjects, 288 trials per data point for half the images, 144 per data point for the other half. Lines represent the posterior predictive means across subjects, with the shaded region giving the 95% HDI. (B) Example model metamers for two extreme images. The top row (nyc) is the image with the best performance (purple line in panel A), while the bottom row (llama) has the worst performance (red line in panel A). In each row, the leftmost image is the target image, and the next two show model metamers with the lowest and highest tested scaling values for this comparison. Performance on the llama image is poor because much of the image content resembles pink noise. Thus, even with larger scaling values, the model metamers are very difficult to distinguish from the target image. The nyc image, on the other hand, contains hard edges with precise alignment of phase across scales. As the energy model discards phase information, this phase structure is lost in the model metamers, which are consequently easy to distinguish from the target image at all tested scaling values. However, this pattern does not hold in the luminance model, or for synthesized vs. synthesized comparisons, for which both images exhibit typical performance (see appendix figure 6). Full resolution version of this figure can be found on the OSF

When discriminating two synthesized images, the initial image affects performance

The two types of comparisons shown in figure 6 — original vs. synthesized and synthesized vs. synthesized — show very different critical scaling values. This indicates that for a particular scaling value and set of image statistics, some image pairs are much easier to discriminate than others. We hypothesize that metamers synthesized from white noise seeds are restricted to a relatively small region of the full set of model metamers. As a result, these images are more perceptually similar to each other than they are to the target image. To generate metamers outside of this set, we also used other natural images from our data set to initialize the synthesis procedure (which was not done in previous studies, Freeman and Simoncelli (2011); Wallis et al. (2019)).

Figure 9A shows behavior for these additional comparisons in a single subject, sub-00. Changing the initialization image has a large effect on the synthesized vs. synthesized comparison but little-to-no effect on the original vs. synthesized comparison. For synthesized vs. synthesized, initializing with a different natural image improves performance compared to initializing with white noise, but is still worse than performance for original vs. synthesized. Importantly, these results cannot be predicted from the model, which gives no specific insight as to why some pairs are more discriminable than others.

Initializing model metamers with natural images does not affect performance in the original vs. synthesized comparison, but reduces critical scaling and increases max d′ for the synthesized vs. synthesized comparison. Note all data in this figure is for a single subject and 15 of the 20 target images. (A) Probability correct for one subject (sub-00), as a function of scaling. Each point represents the average of 540 trials (over all fifteen images), except for the synthesized vs. synthesized luminance model white noise comparison (averaged over 5 images). Vertical black line indicates scaling value where difficulty ran from chance to 100%, based on initialization and comparison, as discussed in panel B. (B) Three comparisons corresponding to the three psychophysical curves intersected by the vertical black line in panel A. See text for details. Full resolution version of this figure can be found on the OSF.

Figure 9B shows three comparisons involving five metamers arising from different initializations, each with scaling corresponding to the vertical line in panel A, but with dramatically different human performance. The top row shows the easiest comparison, between the original image and a synthesized image initialized with a different natural image (bike); the subject was able to distinguish these two images with near-perfect accuracy (in this case, when comparing against a natural image, performance is identical regardless of whether the metamer was initialized with white noise or natural image). The bottom row shows the hardest comparison, between two synthesized images initialized with different samples of white noise. As discussed above, comparing two images of this type is difficult even with large pooling windows; at this scaling level, humans are insensitive to the differences between them, and so performance was at chance. The middle row shows two synthesized images, initialized with different natural images, which the subject was able to distinguish with moderate accuracy. When comparing these two images, one can see features in the periphery that remain from the initial image (tiles and highway, respectively). Even when fixating, the subject was able to use these features to distinguish the two images, i.e., the human was sensitive to them while the model was not. This reinforces the notion that the initialization of the synthesis process matters. In both the middle and bottom row, both images are synthesized (i.e., neither row contains the target image,) yet one comparison is much harder than the other.

Parameter values for the comparisons shown in figure 9A (Top: critical scaling value; Bottom: max d′). Data shown is from the single subject who completed all comparisons. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed lines and shaded regions average across all shown images for this subject. Note that the luminance model, synthesized vs. synthesized: white noise comparison is not shown in this panel, because the data was poorly fit by this curve.

The models reach critical scaling at different dimensionality

For each model, the number of statistics is proportional to the number of pooling regions, and thus decreases quadratically with scaling. Table 1 shows average critical scaling values across all conditions, along with the corresponding number of model statistics. We can see that critical scaling does not correspond to a fixed number of statistics. We should also note that if one were to use the model outputs as a compressed representation of the image, the number of statistics in each representation is almost certainly an overcount, for several reasons. First, in order to ensure that the Gaussian pooling windows uniformly tile the image, the most peripheral windows in the model have the majority of their mass off the image, which is necessary to avoid synthesis artifacts. Second, for the energy model, we did not attempt to determine how the precise number of scales or orientations affected metamer synthesis, and currently all scales are equally-weighted across the image. As the human visual system is insensitive to high frequencies in the periphery and low frequencies in the fovea, this is probably unnecessary, and so some of these statistics can likely be discarded. Finally, our pooling windows are highly overlapping and thus the pooled statistics are far from independent; this redundancy means that the effective dimensionality of our model representations is lower than the quoted number of statistics.

Critical scaling (posterior mean over all subjects and images) and number of statistics (as a percentage of number of input image pixels - a type of “compression rate”), for each model and comparison. Note that the critical scaling for original vs. synthesized comparisons does not result in the same number of statistics across models and, in particular, at their critical scaling values, all models have dimensionality less than that of the input image.

Discussion

We measured perceptual discriminability of wide field-of-view metamers of two foveated pooling models of human vision. We found that performance depended on the model type (i.e., the type of image statistics that are pooled), the nature of the comparison (original vs. synthesized, or synthesized vs. synthesized), the seed image used for synthesis, and to a modest extent, the natural image target. Specifically, we found that critical scaling was much smaller for the luminance than for the energy model, and much smaller for original vs. synthesized than for synthesized vs. synthesized comparisons. For the former, we also found smaller critical scaling values when the synthesis was initialized from another natural image rather than white noise.

The linking proposition underlying the metamer paradigm

In Freeman and Simoncelli (2011), the authors propose a link between perceptual pooling models of the type found in this study and visual system physiology. They hypothesize that the critical scaling reflects the receptive field sizes of visual areas representing the corresponding statistics. Using this logic, they found a critical scaling for their texture model of approximately 0.5, and interpreted this as evidence of texture representation in V2, which has receptive field sizes with similar scaling.

This logic follows the “Converse Identity” linking proposition in the framework proposed by Teller (1984): identical perceptual states imply identical physiological states (at some stage of visual processing, and thereafter). Model-generated metamers provide an accessible experimental extension of this logic: at the critical scaling value, identical model outputs imply identical perceptual states, which imply identical physiological states. However, when attempting to link the critical scaling of a pooling model to the physiological scaling of a corresponding brain area, it is important to remember that receptive field size is a function of not just the visual area, but is also influenced by at least the cell class, cortical layer, mapping method, and stimulus type. Therefore, a visual area cannot be fully characterized by a single scaling value. This complicates the task of linking the psychophysical scaling value to the neural scaling value at a particular stage of visual processing. Hence we focus here on how the results vary with the type of model and type of comparison, but we do not seek to unambiguously assign a model to a stage of visual processing.

Visual computation: cascades of feature extraction and local pooling

Visual representations are formed through a cascade of transformations. An appealing hypothesis is that each of these is comprised of the same canonical “feature extraction and blur” computation, differing only in what is extracted and the spatial extent of the blurring (Fukushima, 1980; Douglas et al., 1989; LeCun et al., 1989; Heeger et al., 1996; Riesenhuber and Poggio, 1999; Bruna and Mallat, 2013). In the perception literature, Lettvin (1976) provides an early, informal discussion of this “compulsory feature averaging”, non-foveated versions have been described in Parkes et al. (2001); Pelli et al. (2004); Greenwood et al. (2009), and foveated proposals appear in Balas et al. (2009) and Freeman and Simoncelli (2011). In these, as in the current article, the stages are distinguished by their features, and the scaling of the pooling regions with eccentricity. The optics and photoreceptors pool the incident light over a small regions, V1 pools spectral energy over larger regions, V2 pools texture-like statistics over yet larger regions, and so forth. Consistent with this view, we find that for synthetic metamer stimuli, the pooled model statistic has a dramatic effect on the critical scaling value, which is approximately four times larger for the energy model than for the luminance model in the original vs. synthesized comparison (figure 6 and table 1). This result is consistent across all subjects and all target images (figure 6B). The relationship between critical scaling and image statistic can be further appreciated by comparing these results to the model of Wallis et al. (2019), which was also fit to original vs. synthesized discrimination data: the critical scaling of our average energy model is about three times smaller than the average value for their texture model. Together, the three results show critical scaling ratios of approximately 1:4:12 for features corresponding to luminance:spectral-energy:texture, respectively (solid circles in figure 11).

Critical scaling values for the two pooling models presented in this paper (Luminance and Energy) and the Texture model (originally tested in Freeman and Simoncelli (2011), data from Wallis et al. (2019), averaging across the two image classes). Solid points indicate the original vs. synthesized white noise comparisons, while hollow points indicate synthesized vs. synthesized white noise comparisons (for the luminance model, this is effectively infinite, since participants were unable to complete the task). For all three models, critical values are smaller in the original vs. synthesized comparison than the synthesized vs. synthesized one, and their ratio decreases with increasing complexity of image statistics. A potential explanation for this is that the more complex models approximate computations performed in deeper levels of the visual hierarchy, beyond which there are fewer remaining stages to discard information.

Why does critical scaling depend on the comparison being performed?

We found large effects of comparison type on performance, consistent with those reported in Wallis et al. (2019) and Deza et al. (2019). For the energy model, with synthesis initialized from white noise, the critical scaling for synthesized vs. synthesized was about four times larger than that for original vs. synthesized. The maximum discriminability was also much lower for the former than the latter. For the luminance model, the difference was even more dramatic: participants were generally unable to discriminate any pairs for the synthesized vs. synthesized comparison. Differences between the two types of comparisons, which are summarized in figure 11, arise from an interaction between human perception, the model, and the synthesis process. Here we consider two abstract scenarios, one in which the comparison does matter and one in which it does not, in order to illustrate this interaction.

In the left panel of figure 12, the comparison does not matter: model metamers are distinguishable from each other if and only if they are distinguishable from the target image. The right panel illustrates a configuration in which this condition does not hold: two synthetic images can be indistinguishable, even though each are distinguishable from the target image. The idealized version implicitly assumes that the synthesized images sample the manifold of possible model metamers broadly, but our synthesis procedure (similar to those of Freeman and Simoncelli (2011) and Wallis et al. (2019)) does not guarantee this. Initialization with natural images provides an intuitive, but ad-hoc, method of sampling from a broader portion of the manifold of possible model metamers, and results in smaller critical scaling values. More principled statistical sampling approaches (e.g., Markov Chain Monte Carlo) could result in more representative metamers.

In the idealized version of the metamer paradigm, synthesized images broadly sample the space of possible model metamers. Thus, at large scaling values, synthesized images are distinguishable from each other and the target image, and they become indistinguishable from each other at the same scaling value for which they are indistinguishable from the target (left panel). In this case, model metamers are metameric with each other if and only if they are metameric with the target image. In our experements, however, this is not the case: at some scaling values, two synthesized images can be metameric with each other, but distinguishable from the target image (right panel).

Why does this critical scaling discrepancy decrease with increasing feature complexity?

As seen in figure 11, the difference in critical scaling between the two types of comparisons declines as the image statistics being pooled become more complex. While the original vs. synthesized critical scaling value is lower for all three models, the gap between the two decreases as the models increase in complexity: infinite for the luminance model, roughly quadruple for energy, and less than double for texture (see figure 11).

One potential explanation for this observation is that these computations are being performed in deeper stages of the visual hierarchy and there are progressively fewer opportunities to discard information later in the hierarchy. For example, the difference in V1 responses for a pair of images may be discarded by a later stage (e.g., area IT), but there are not many steps of processing between IT and the perceptual read out where differences between IT responses can be discarded. This may explain why we see no overlap between the critical scaling values for original vs. synthesized and synthesized vs. synthesized comparisons across images in figure 6B, whereas Wallis et al. (2019) find substantial overlap for the texture model. Ultimately, these possibilities can only be distinguished through further physiological measurements.

Another potential explanation is schematized in figure 13: for the luminance model, the model metamer classes are aligned with the perceptual metamer classes and the white noise samples such that, for a given scaling value, all model metamers initialized with white noise fall into the same perceptual metamer class. This is an extreme case of the synthesis issue described in the previous section. On the other hand, the texture model’s metamer classes are orthogonal to the white noise samples, so that synthesized model metamers easily fall into distinct perceptual metamer classes, leading to a critical scaling value that is much more similar to the value for the original vs. synth comparison.

The difference between the synth vs. synth and original vs. synth white noise comparisons decreases as model complexity increases (compare with figure 12). Because of the alignment between the luminance model metamer classes, the underlying perceptual metamer classes, and the white noise samples used to initialize the synthesis process, synthesized images always lie within the same perceptual metamer class and are thus never distinguishable. On the other hand, the texture model’s metamer classes grow orthogonally to the white noise samples, resulting in a much smaller critical scaling value for the synth vs. synth white noise comparison.

Why and when does the critical scaling depend on synthesis initialization?

The previous sections show that the critical scaling depends on the comparison type and feature complexity. We also find that it depends on synthesis initialization. Natural images are more likely than synthetic images to include information that the human visual system has evolved to discriminate, as opposed to information that is discarded at some later stage of processing. Using natural images to initialize synthesis (or development of novel synthesis methods to better explore the space of metamers) may reduce the discrepancy between the two conditions.

The schematic presented in figure 14 provides a potential explanation for why critical scaling values depend on the comparison and synthesis initialization. Ideally, synthesized metamers are discriminable from each other if and only if they are also discriminable from the target, as described in section Why does critical scaling depend on the comparison being performed?. The variability in critical scaling across comparisons indicates that this condition is being violated: the synthesized vs. synthesized conditions have a higher critical scaling value indicates that the synthesized images used in those comparisons are falling into the same metamer class. This effect is reduced when initializing with a natural image.

Schematics describing results presented in this paper. Unlike the idealized metamer paradigm (see section Why does critical scaling depend on the comparison being performed? and figure 12A), the critical scaling value for our models depends on the comparison being performed and, to a lesser extent, the image used to initialize synthesis. For the original vs. synthesized comparison, this does not affect the critical scaling value, as the synthesized images are always distinct from the target. However, initializing with white noise can result in synthesized images that lie in the same metamer class as each other even while they are distinct from the target, resulting in a relatively large scaling value for the synthesized vs. synthesized comparison. Initializing with natural images reduces the magnitude of this phenomenon.

This suggested the noise-initialized algorithm produces a biased sampling of the space of all possible model metamers. For each target image, model, scaling value, and initialization condition (white noise or natural image), we generated three model metamers. If these were distributed across the space of all possible model metamers, we would not see the dependence on initialization depicted in figure 14. Initializing with a natural image was one attempt to sample a different portion of this space, and it did reduce the discrepancy between comparison conditions. However, further work needs to be done to better understand the effects of initialization on generated samples. One possibility that seems promising would be to synthesize model metamers for a given target image, model, and scaling value in sets that are as different from each other as possible, quantified using different pooling models, other visual models, or image quality metrics. We believe the metamer paradigm can be made more informative by paying more attention to the synthesis procedure.

We believe a more extreme version of this situation is reflected in the data for the luminance model: the synth vs. synth comparison is never possible when initializing model metamers with white noise (so we cannot estimate the model’s critical scaling), but initializing with natural images does allow these model metamers to be distinguished from each other (see appendix 1). We believe this reflects an extreme version of the biased sampling discussed above: the local luminance matching enforced by the luminance model is a fairly lax constraint and thus initializing with white noise leads to model metamers that consistently lie within the same perceptual metamer class, while at the same time being very discriminable from the target image (see figure 7).

By gathering discriminability data for different types of comparisons, we are able to get a better sense of how the metamer classes of our pooling models align with those of the visual system. Our current data supports the proposal from Wallis et al. (2019) that the critical scaling should be estimated using only the original vs. synthesized comparison, as those values are minimal across comparisons and do not seem to depend on how metamer synthesis is initialized. Furthermore, there are many applications where that is the only comparison that matters (e.g., for generating images that must be indistinguishable from natural images only). However, it is not clear that this will always be the case: while we never sampled from the metamer class containing the target image while the critical scaling value was large, it is theoretically possible, and additional comparisons can provide more protection against over-interpretation results stemming from biased sampling.

Asymptotic performance, but not critical scaling, depends on image content

Similar to Wallis et al. (2019); Brown et al. (2021), we find that metamer discrimination performance is somewhat dependent on image content. Both of those studies synthesize model metamers based on pooled texture statistics, and Wallis et al. (2019) shows that texture-like original images are harder to distinguish from their synthesized images than scene-like ones, while Brown et al. (2021) show that original textures with higher global and local regularity (e.g., woven baskets) are the easiest to distinguish from their synthesized images than those with low regularity (e.g., animal fur). This aligns with our result: the most distinguishable pairs include natural images with features not well-captured by the synthesizing model, whereas the least distinguishable include those natural images whose features are all adequately captured.

However, we should note that we found this image-level variability largely in super-threshold performance, and this variability does not constitute a failure of these pooling models. As pointed out by Freeman and Simoncelli (2011), asymptotic performance also varies with experimental manipulation, while critical scaling remains relatively unaffected. The metamer paradigm makes strong predictions about what happens when the representation of two stimuli are matched: they are indistinguishable, and so performance on a discrimination task will be at chance, as captured by the critical scaling value. However, it makes no predictions about performance at super-threshold levels, as captured by the max d′ parameter. An analogy with color vision seems apt: color matching experiments provide evidence for what spectral distributions of light are perceived as identical colors, but provide no information about whether humans consider blue more similar to green or to red; further investigations are necessary to understand color appearance. Thus, while this image-level variability is worth investigating in order to better understand the sensitivities of our model, it does not much affect the inferences we want to make about the human visual system, and speaks more to the need for a complementary approach (see next section).

Observer models are needed to predict discriminability of non-metameric images

As discussed above, the metamer paradigm’s converse identity linking proposition is silent on what is implied by distinct model outputs and so a complementary approach is required, such as building observer models to predict perceptual distances. Specifically, the metamer paradigm makes no predictions about discriminability for model metamers synthesized with a super-critical scaling value. Such images are discriminable to the synthesizing pooling model at critical scaling. However, the information distinguishing them might be discarded by later stages of visual processing (center panel, figure 1). A complementary approach investigating how such differences are handled by later brain areas is necessary to gain a better understanding of image discriminability beyond “identical or not”. Such work could use the models presented here as a starting point and could draw on the substantial literature of observer models in vision science and image processing.

There are several important properties of the pooling models used in metamer studies that should be revisited if attempting to extend them into observer models. First, the models assume equal sensitivity to all statistics, and that the sensitivity to these statistics does not change across the visual field. Second, the models assume that every statistic is pulled in regions of equal size (e.g., that high frequency is pooled over the same size region as low frequency). Finally, the shape of the pooling windows (e.g., whether windows should be radially elongated with a 2:1 ratio) and their scaling with eccentricity can probably benefit from refinement. When performing the task (especially the original vs. synthesized comparisons), subjects reported that the most informative portions of the image were in the mid-periphery, rather than close to fixation or at the edges of the image. This suggests that the windows may be too large in the mid-periphery, and that window width may be better modeled in a non-linear manner.

The investigation of super-threshold performance and appearance is a necessary complement to the metamer paradigm, which focuses on the question of whether two images are perceptually identical or not. Whereas observer models and image quality metrics often rely on natural images, common experimental stimuli, or sets of common distortions, the metamer paradigm relies on the synthesis of novel images, often turning up unexpected exemplars. Combining and extending this synthesis-focused approach with an attention to super-threshold performance would help lead to a fuller understanding of human perceptual sensitivities and insensitivities.

Materials and Methods

All experimental materials, data, and code for this project are available online under the MIT or similarly permissive licenses. Specifically, software is on GitHub, synthesized metamers can be browsed on this website, and all images and data can be downloaded from the OSF. The GitHub site provides instructions for downloading and using data.

Synthesis

We synthesized model metamers matching 20 different natural images (the target images) from the authors’ (W.F.B and E.P.S) personal collections, as well as from the UPenn Natural Image Database (Tkačik et al., 2011) and from an unpublished collection by David Brainard. The selected photos were high-resolution with 16-bit pixel intensities proportional to luminance, that had not undergone lossy compression (which could result in artifacts). They were converted to grayscale using scikit-image’s color.rgb2gray function (van der Walt et al., 2014), cropped to 2048 by 2600 pixels (the Brainard photos were 2014 pixels tall, so a small amount of reflection padding was used to reach 2048 pixels), and had their pixel values rescaled to lie between 0.05 and 0.95. Synthesized images were still allowed to have pixel values between 0 and 1; without rescaling the target images, synthesis resulted in strange artifacts with pixels near 0, as this was the minimum allowed value. The images were chosen to span a variety of natural image content types, including buildings, animals, and natural textures (see figure 5).

We synthesized the model metamers using custom software written in PyTorch (Paszke et al., 2019), using the AMSGrad variant of the Adam optimization algorithm (Kingma and Ba, 2014; Reddi et al., 2018), with learning rate 0.01. Slightly different approaches were used for the luminance and energy model metamers. For the luminance model metamers, the objective function was to minimize the mean-squared error between the model representation of the target and synthesized images, , and synthesis was run for 5000 iterations. For the energy model metamers, the objective function also contained a quadratic range penalty term, which penalized any pixel values outside of [0, 1], , where

Synthesis was run for 15000 iterations. Additionally, energy model metamer synthesis used stochastic weight averaging (Izmailov et al., 2018), which helped avoid local optima by averaging over pixel values as synthesis neared convergence, and used coarse-to-fine optimization (Portilla and Simoncelli, 2000). Additionally, each statistic (in both models) was z-scored using the average statistic value computed across the entire image on a selection of grayscale texture images. For both models, synthesis terminated early if the loss had not decreased by more than 1e − 9 over the past 50 iterations. While not all model metamers achieved the same loss values, with differences in synthesis loss across target images, there was no relationship between the remaining loss and behavioral performance.

For each model, its windows were represented as two tensors, one for angular slices and one for annuli, which, when multiplied together, would give the individual windows, with separate sets of windows for each scale in the energy model. This required a large amount of memory, and so for scaling values below 0.09, models were too large to perform synthesis on the available NVIDIA GPUs with 32GB of memory. Thus, all luminance model metamers were computed on the CPU, and synthesis of a single image took from about an hour for scaling 1.5 to 2 days for scaling 0.058 to 14 days for scaling 0.01. For the energy model metamers, the lowest two scaling values were computed on the CPU, with synthesis taking about a week. For those energy model metamers which were able to be computed on the GPU, synthesis took from 5 hours for scaling 0.095 to 1.5 hours for scaling 0.27 and above. This synthesis procedure was completed in parallel using the high-performance computing cluster at the Flatiron Institute.

Synthesized images for original vs. synthesized and synthesized vs. synthesized white noise comparisons (see Psychophysical experiment) were initialized with full-field patches of white noise (each pixel sampled from a uniform distribution between 0 and 1). For each model, scaling value, and target image, three different initialization seeds were used. A unique set of three seeds was used for each scaling value and target image, except for the following, which all used seeds {0, 1, 2}:

  • Luminance model: azulejos, bike, graffti, llama, terraces, tiles; scaling 0.01, 0.013, 0.017, 0.021, 0.027, 0.035, 0.045, 0.058, 0.075 and 0.5.

  • Energy model: azulejos, bike, graffti, llama, terraces, tiles; scaling 0.095, 0.12, 0.14, 0.18, 0.22,0.27, 0.33, 0.4, and 0.5.

For original vs. synthesized and synthesized vs. synthesized natural image comparison, synthesized images for each model, scaling value, and target image were initialized with three random choices from among the rest of the target images.

Pooling windows

Pooling model windows are laid out in a log-polar grid, with peaks spaced one standard deviation apart, such that adjacent window functions cross at a value of 0.352 (relative to max of 0.4). They have a single parameter, scaling, which specifies the ratio of the pooling window diameter at full-width half-max in the radial direction and the distance of its center from the fovea, both in degrees. For example, the pooling windows of a model with scaling factor 0.1 have a radial diameter of 1 degree at 10 degrees eccentricity, 2 at 20 degrees, etc.

Image pixels within 0.5 degree from the fixation point were exactly matched in our synthesized images, approximating the fovea, where no pooling occurs. Additionally, for small scaling values, windows for some distance beyond this region would be smaller than a pixel and so the only solution is to match the pixel values in that region directly. For example, with image resolution of 2048 by 2600 and display size of 53.6 by 42.2 degrees, models with scaling value of 0.063 have windows whose diameter at FWHM is smaller than a pixel out to 0.52 degrees, with this number increasing quadratically as scaling decreases, reaching 3.29 degrees for scaling 0.01 (see 5 for more discussion).

Observers

Eight participants (5 women and 3 men, aged 24 to 33), including an author (W.F.B.), participated in the study and were recruited from New York University. All subjects had normal or corrected-to-normal vision. Each subject completed nine one-hour sessions. One subject (sub-00) also performed seven additional sessions. All subjects provided informed consent before participating in the study. The experiment was conducted in accordance with the Declaration of Helsinki and was approved by the New York University ethics committee on activities involving human subjects.

Psychophysical experiment

A psychophysical experiment was run in order to determine which of the synthesized model metamers were also perceptual metamers. We first describe the structure of a single trial, then how the trials were organized into blocks and sessions.

Trial structure

See figure 15 for schematic. Observers viewed a series of grayscale images on a monitor, at a size of 53.6 by 42.2 degrees. An initial image, divided in half by a vertical midgray bar 2 degrees wide, was displayed for 200 msecs, before being replaced by a midgray screen for 500 msecs, followed by a second image for another 200 msecs (also divided by a vertical midgray bar). Images were presented for 200 msecs to minimize the possibility of eye movements. The dividing bar prevented participants’ use of discontinuities between the two image halves to perform the task. One side of the second image (left half or right half) was identical to the first image, and the other side changed. After the second image was viewed, a midgray screen appeared with text prompting a response, and the observer’s task was to report which half had changed. The observer had as much time as necessary to respond. The two compared images were either two synthesized images (synthesized for identical models with the same scaling value and target image, but different initializations) or one synthesized image and its target image. Either image could be presented first.

Schematic of psychophysics task. Top shows the structure for a single trial: a single image is presented for 200 msec, bisected in the center by a gray bar, followed by a blank screen for 500 msec. The image is re-displayed, with a random half of the image changed to the comparison image, for 200 msec. The participants then have as long as needed to say which half of the image changed, followed by a 500 msec intertrial interval. Bottom table shows possible comparisons. In original vs. synthesized, one image was the target image whose model representation the synthesized images match (see figure 5), and the other was one such synthesized image. In synthesized vs. synthesized, both were synthesized images targeting the same original image, with the same model scaling, but different initialization. In experiments, dividing bar, blanks, and backgrounds were all midgray. For more details see text.

The midgray blank screen presented between the two image presentations reduces motion cues participants could use to discriminate the two images. Our models aim to capture the steady state response to the images, not the transient response. The mask forces the participants to use the image content to discriminate between the two images, rather than relying on temporal edges (analogous to our use of the vertical bar to prevent the use of spatial edges). This introduces a short-term memory component in the task (participants must remember the first image in order to compare it to the second image), as in previous metamer discrimination experiments (Freeman and Simoncelli, 2011; Deza et al., 2019; Wallis et al., 2019). We believe the precise duration of this mask is unimportant for our results: first, Bennett and Cortese (1996) found the duration of a blank screen did not affect thresholds in a spatial frequency discrimination task over a range from 200 to 10,000 msec, and second, mask duration is likely to have a similar effect on performance as image presentation duration, which Freeman and Simoncelli (2011) found affected asymptotic performance but not critical scaling.

Session and block organization

Across 9 sessions, each subject completed a total of 12,960 trials, factored into 3 model/comparison combinations by 8 scaling values by 15 target images by 36 repetitions. The 3 model/comparison combinations were 1) luminance model, original vs. synth, white noise; 2) energy model, original vs. synth, white noise; and 3) energy model, synth vs. synth, white noise. The 8 scaling values were logarithmically spaced, with the range chosen separately for each model/comparison to span an appropriate range of values. For luminance model, original vs. synth, white noise, the scaling endpoints were 0.01 and 0.058; for energy model, original vs. synth, white noise, the endpoints were 0.063 and 0.27; and energy model, synth vs. synth, white noise, the endpoints were 0.27 and 1.5. There were a total of 20 target images, but each subject only saw 15. Every subject saw images 1 through 10. Half the subjects also saw images 11 through 15, and half saw images 16 through 20 (see figure 5). The 36 repetitions were averaged for analysis and included 12 trials for each of 3 synthesis seeds. For the white noise-initialized comparisons, these seeds were independent samples of white noise used to initialize the synthesis procedure, resulting in 3 distinct model metamers.

Each of the above model/comparisons was tested across 3 sessions, each lasting approximately one hour. Each subject started with either the luminance or energy model, original vs. synth, white noise. The 3 sessions required for the model/comparison tested first were completed before moving onto 3 sessions testing the other model. The order of the two models was randomized across subjects. After completing these 6 sessions, the subject completed 3 sessions testing the energy model, synth vs. synth, white noise. This comparison was last as it was the most difficult.

Each of the 9 sessions consisted of 1,440 trials, containing all 36 repetitions for all 8 scaling values for 5 of the 15 target images viewed by the subject (target images were randomly assigned to sessions, independently for each subject). The 1,440 trials per session were broken up into 5 blocks of 288 trials each. Each block took about 8 to 12 minutes, and consisted of 12 repetitions for all 8 scaling values for 3 of the 5 target images.

In addition, one subject (sub-00) completed 7 additional sessions (10,080 additional trials). This included 1 session for luminance model, synth vs. synth, white noise; 3 for energy model, original vs. synth, natural image; and 3 for energy model, synth vs. synth, natural image. As with the comparisons that all subjects completed, these sessions each included 1,440 trials, factored into 5 target images by 8 scaling values by 36 repetitions. Only 1 session was included for luminance model, original vs. synth, white noise because performance was at chance for all images and all scaling values (see figures 6 and 7). No sessions were completed for the luminance model, natural image comparisons due to the time required for synthesis; see appendix section 1 for more information.

The four types of comparisons are explained in full below:

  1. Original vs. synthesized, white noise: the two images being compared were always one synthesized image and its target image, and the synthesized image was initialized with a sample of white noise.

  2. Synthesized vs. synthesized, white noise: both images were synthesized, with the same model, scaling value, and target image, but different white noise seeds as synthesis initialization.

  3. Original vs. synthesized, natural image: the two images being compared were always one synthesized image and its target image, and the synthesized image was initialized with a different natural image drawn randomly from our set.

  4. Synthesized vs. synthesized, natural image: both images were synthesized, with the same model, scaling value, and target image, but initialized with different natural images from our set.

Subjects completed several training blocks. Before their first session, they completed an initial training block, comparing two natural images and two noise samples (one white, one pink). Before their first session of each comparison type including a natural image, they completed a secondary training block showing two natural images and two synthesized images of the type included in the session, one with the largest scaling included in the task and one with the smallest. Before the session comparing two synthesized images, they similarly completed a training block comparing four synthesized images, two with a low scaling value and two with a high scaling value, for each of two target images. Each training block took one to two minutes and was repeated if performance on the high scaling synthesized images was below 90% or subjects expressed uncertainty about their ability to perform the task (participants were expected to perform close to chance for the low scaling synthesized images). Additionally, before each session which included a natural image (the original vs. synthesized comparisons), subjects were shown the five natural images that would be part of that session, as well as two example synthesized images per target image, one with a low scaling value, one with a high scaling value. Before each session comparing two synthesized images (the synthesized vs. synthesized comparison), subjects were shown four example synthesized images per target image, two with the lowest scaling value and two with the highest scaling value for that comparison. A video of a single energy model training block, original vs. synthesized: white noise comparison, can be found on the OSF.

Apparatus

The stimuli were displayed on an Eizo CS2740 LED flat monitor running at 60 Hz with resolution 3840×2160. The monitor was gamma-corrected to yield a linear relationship between luminance and pixel value. The maximum, minimum, and mean luminances were 147.73, .3939, and 77.31 cd/m2, respectively.

The experiment was run with a viewing distance of 40 cm, giving 48.5 pixels per degree of visual angle. A chin and forehead rest was used to maintain head position, but the subjects’ eyes were not tracked.

The experiment was run using custom code written in Python 3.7.0 using PsychoPy 3.1.5 (Peirce et al., 2019), run on an Ubuntu 20.04 LTS desktop. A button box was used to record the psychophysical response data. All stimuli were presented as 8-bit grayscale images.

Data analysis

All trials were analyzed, a total of 4,320 trials per subject per model per comparison (across 15 images and 8 scaling values) for all energy model comparisons and for luminance model original vs. synthesized white noise comparison. Luminance model synthesized vs. synthesized, white noise comparison had 1,440 trials (across 5 images and 8 scaling values) for a single subject. Where behavioral data is plotted in this paper, the proportion correct is the average across all relevant trials.

We fit psychophysical curves describing proportion correct as a function of model scaling using the two-parameter function for discriminability d′ derived in Freeman and Simoncelli (2011):

where sc is the critical scaling value (performance is at chance for scaling values at or below sc) and α is the max d′ value (called the “proportionality factor” in Freeman and Simoncelli (2011)).

Psychophysical curves were constructed by converting this d′ into the probability correct using the same function as in Freeman and Simoncelli (2011):

where ϕ is the cumulative of the normal distribution. The probability correct is 50% when d′ = 0 (and thus when scaling is at or below the critical scaling), reaches about 79% when d′ = 2 and 98% when d′ = 4. As the α parameter above gives the maximum d′ value, it has a monotonic relationship with the asymptotic performance, which can be seen in figure 16.

Relationship between the max d′ parameter, α and asymptotic performance. As max d′ increases beyond approximately 5 (where asymptotic performance is at ceiling), the slope of the psychophysical curve continues to increase (for example, compare the slope of the luminance and energy model original vs. synth white noise comparisons in figure 9A).

The posterior distribution over parameters sc and α was estimated using a hierarchical, partial-pooling model, with independent subject- and image-level effects for both sc and α, with each model and comparison estimated separately, following the procedure used in Wallis et al. (2019). Subject responses were modeled as samples from a Bernoulli distribution with probability (1 − π)P(s)+.5π, where π is the lapse rate, estimated independently for each subject. Estimates were obtained using a Markov Chain Monte Carlo (MCMC) procedure written in Python 3.7.10 (Van Rossum and Drake, 2009) using the numpyro package, version 0.8.0 (Phan et al., 2019; Bingham et al., 2018). MCMC sampling was conducted using the No U-Turn Sampler algorithm (Hoffman and Gelman, 2014), with parameters selected to ensure convergence, which was assessed using the statistics (Brooks and Gelman (1998), looking for , Vehtari et al. (2021)) and by examining traceplots.

Parameters were given weakly-informative priors and both sc and α were estimated on natural logarithmic scales.

In sum, for model m ∈ {E, L}, comparison t, subject x, image i, and scaling s:

with the following priors:

The priors for sc,mt of the energy and luminance models correspond to critical scales of 0.25 and 0.018, respectively, the centers of the V1 physiological range provided in Freeman and Simoncelli (2011) figure 5, and from the slope of a line fit to the dendritic field diameter vs. eccentricity of midget retinal ganglion cells in Dacey and Petersen (1992) figure 2B (see appendix section 2). This captures our prediction that the models’ critical scaling values should be similar to those of the physiological scaling in the brain area sensitive to the same image features, should be independent of comparison type and consistent across images and subjects, while not placing too much of a constraint on the parameters.

The posterior distribution represents the model’s beliefs about the parameters given the priors and data and is summarized throughout this paper as the posterior mean and 95% high density intervals. The latter represents the range of values containing 95% of the distribution with the highest probability, as opposed to the more common 95% confidence interval, which is symmetrically arranged around the mean. The two are identical for symmetric distributions, but can diverge markedly if the distribution is highly skewed (Kruschke, 2015)).

Software

These experiments relied on a variety of custom scripts written in Python 3.7.10 (Van Rossum and Drake, 2009), all found in the GitHub repository associated with this paper. The following packages were used: snakemake (Mölder et al., 2021), JAX (Bradbury et al., 2018), matplotlib (Hunter, 2007), psychopy (Peirce et al., 2019), scipy (Virtanen et al., 2020), scikit-image (van der Walt et al., 2014), pytorch (Paszke et al., 2019), arviz (Kumar et al., 2019), numpyro (Phan et al., 2019; Bingham et al., 2018), pandas (Reback et al., 2021; McKinney, 2010), seaborn (Waskom, 2021), jupyterlab (Kluyver et al., 2016), and xarray (Hoyer and Hamman, 2017).

Acknowledgements

The authors would like to thank David Brainard for the use of his photographs, both from the published UPenn Natural Image Database (Tkačik et al., 2011) and the unpublished set of images from around Philadelphia. They would also like to thank Tony Movshon, David Heeger, David Brainard, Corey Ziemba, and Colin Bredenberg for their feedback on the manuscript, Mike Landy for his assistance with the design of the psychophysical task and feedback on the manuscript, Heiko Schütt for his assistance with the Markov Chain Monte Carlo analysis, and the authors of Wallis et al. (2019) for sharing their code and data. Furthermore, they would like to thank Liz Lovero, Paul Murray, Dylan Simon, and Aaron Watters for their work in creating the metamer browser website.