Schematic diagram of perceptual metamers. Each panel shows a two-dimensional depiction of the set of all possible images: every image corresponds to a point in this space and every point in this space represents an image. Perceptual metamers — images that cannot be reliably distinguished — are equivalence classes in this space and we illustrate this by partitioning the space into distinct perceptually identical regions. Left: example image (black point), and surrounding metameric images (region enclosed by black polygon). Center: In a hierarchical visual system, in which each stage transforms the signals of the previous stage and discards some information, metamers of earlier stages are nested within metamers of later stages. That is, every pair of images that are metamers for an early visual area 𝒩1 are also metamers for the later visual area 𝒩2. The converse does not hold: there are images that 𝒩1 can distinguish but that 𝒩2 cannot. Right: Two particular image families: Samples of white noise (distribution represented as a grayscale intensity map in the lower right corner) and the set (manifold) of natural images (distribution represented by the curved gray line). Typical white noise samples fall within a single perceptual metamer class (humans are unable to distinguish them). Natural images, on the other hand, are generally distinguishable from each other, as well as from un-natural images (those that lie off the manifold).

Left: Pooling models are parameterized by their scaling value, s, and the statistics they pool, θ. Like any system that discards information, including the human visual system, these models have metamers: sets of images that are perceived as identical, represented graphically as enclosed non-overlapping regions. We draw samples from these sets using an optimization procedure: starting with initial image Ii we adjust the pixel values until their pooled statistics match those of the target (original) image T. The synthetic image depends on the target image, the metamer model, and also on the initial point and the stochastic synthesis algorithm. For a given set of statistics θ, increasing the scaling value will increase the size of the metamer set in a nested manner: any image that is a metamer for s,θ will also be a metamer for αs,θ, for factor α > 1. Right: Changing the set of pooled statistics from θ to ϕ will result in different sets of model metamers, which may or may not overlap with those of the original model (though both must include the target image, T). If the model’s metamer classes differ substantially in their shape from the perceptual metamer classes, they will not provide a good description of the perceptual metamers at critical scaling (e.g., the blue ellipse contains only a small portion of the surrounding perceptual metamer region).

Two pooling models. Both models compute local image statistics weighted by a Gaussian that is separable in a log-polar coordinate system, such that radial extent is approximately twice the angular extent (half-maximum levels indicated by red contours). Windows are laid out in a log-polar grid, with peaks separated by one standard deviation. A single scaling factor governs the size of all pooling windows. The luminance model (top) computes average luminance, approximating the spatial pooling performed by retinal ganglion cells. The spectral energy model (bottom) computes average spectral energy at 4 orientation and 6 scales, as well as luminance, for a total of 25 statistics per window, approximating the representation of complex cells in primary visual cortex (V1). Spectral energy is computed using the complex steerable pyramid constructed in the Fourier domain (Simoncelli and Freeman, 1995), squaring and summing across the real and imaginary components. Full resolution version of this figure can found on the OSF.

Example synthesized model metamers. Top: Target image. Middle: Luminance model metamers, computed for two different scaling values (values as indicated, red ellipses to right of fixation indicate pooling window contours at half-max at that eccentricity). The left image is computed with a small scaling value, and is a perceptual metamer for most subjects: when fixating at the cross in the center of the image, the two images appear perceptually identical to the target image. Note, however, that when fixating in the periphery (e.g., the blue box), one can clearly see that the image differs from the target (see enlarged versions of the foveal and peripheral neighborhoods to right). The right image is computed with a larger scaling value, and is no longer a perceptual metamer (for any choice of observer fixation). Bottom: Energy model metamers. Again, the left image is computed with a small scaling value and is a perceptual metamer for most observers when fixated on the center cross. Peripheral content (e.g., blue box) contains more complex distortions, readily visible when viewed directly. The right image, computed with a large scaling value, differs in appearance from the target regardless of observer fixation. Full resolution version of this figure can be found on the OSF.

Target images used in the experiments. Images contain a variety of content, including textures, objects, and scenes. All are relatively high-resolution RAW camera images, with values proportional to luminance and quantized to 16 bits. Images were converted to grayscale, cropped to 2048 x 2600 pixels, displayed at 53:6 x 42:2 degrees, with intensity values rescaled to lie within the range of [0.05, 0.95] of the display intensities. All subjects saw target images 1-10, half saw 11-15, and half saw 16-20. A full resolution version of this figure can be found on the OSF.

Performance and psychophysical curve parameters values values for different models and image comparisons. The luminance model has a substantially smaller critical scaling than the energy model, and original vs. synthesized comparisons yield smaller critical scaling values than synthesized vs. synthesized comparisons. (A) Psychometric functions, expressing probability correct as a function of scaling parameter, for both energy and luminance models (aqua and beige, respectively), and original vs. synthesized (solid line) and synthesized vs. synthesized (dashed line) comparisons. Data points represent average values across subjects and images, 4320 trials per data point except for luminance model synthesized vs. synthesized comparison, which have only 180 trials per data point (one subject, five images). Lines represent the posterior predictive means of fitted curves across subjects and images, with the shaded region indicating the 95% high-density interval (HDI, Kruschke (2015)). Horizontal bars (below dashed line at 0.5) indicate the range of physiological scaling values for the associated retinal ganglion cell type or cortical area. (B) Estimated parameter values, separated by image (left) or subject (right). Top row shows the critical scaling value and the bottom the value of the maximum d′ parameter. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed lines and shaded regions the global means and 95% HDI. Note that the luminance model, synthesized vs. synthesized comparison is not shown, because the data are poorly fit (panel A, beige dashed line).

Comparison of two synthesized metamers is more difficult than comparison of a synthesized metamer with the original image. For the highest tested scaling value (1.5) the original vs. synthesized comparison is trivial while the synthesized vs. synthesized comparison is difficult (energy model) or impossible (luminance model). Top: target image. Middle: Two luminance model metamers, generated from different initial uniform noise images. Bottom: Two energy model metamers, generated from different initial uniform noise images. All four of the model metamers can be easily distinguished from the natural image at top (original vs. synthesized), but are difficult to distinguish from each other, despite the fact that their pooling windows have grown very large (synthesized vs. synthesized). Full resolution version of this figure can be found on the OSF.

The interaction between image content and model sensitivities greatly affects asymptotic performance, most noticeably on the synthesized vs. synthesized comparison for the energy model, while critical scaling does not vary as much. (A) Performance for each image, averaged across subjects, comparing synthesized images to natural images. Most images show similar performance, with one obvious outlier whose performance never rises above 60%. Data points represent the average across subjects, 288 trials per data point for half the images, 144 per data point for the other half. Lines represent the posterior predictive means across subjects, with the shaded region giving the 95% HDI. (B) Example model metamers for two extreme images. The top row (nyc) is the image with the best performance (purple line in panel A), while the bottom row (llama) has the worst performance (red line in panel A). In each row, the leftmost image is the target image, and the next two show model metamers with the lowest and highest tested scaling values for this comparison. Performance on the llama image is poor because much of the image content resembles pink noise. Thus, even with larger scaling values, the model metamers are very difficult to distinguish from the target image. The nyc image, on the other hand, contains hard edges with precise alignment of phase across scales. As the energy model discards phase information, this phase structure is lost in the model metamers, which are consequently easy to distinguish from the target image at all tested scaling values. However, this pattern does not hold in the luminance model, or for synthesized vs. synthesized comparisons, for which both images exhibit typical performance (see appendix figure 6). Full resolution version of this figure can be found on the OSF

Initializing model metamers with natural images does not affect performance in the original vs. synthesized comparison, but reduces critical scaling and increases max d′ for the synthesized vs. synthesized comparison. Note all data in this figure is for a single subject and 15 of the 20 target images. (A) Probability correct for one subject (sub-00), as a function of scaling. Each point represents the average of 540 trials (over all fifteen images), except for the synthesized vs. synthesized luminance model white noise comparison (averaged over 5 images). Vertical black line indicates scaling value where difficulty ran from chance to 100%, based on initialization and comparison, as discussed in panel B. (B) Three comparisons corresponding to the three psychophysical curves intersected by the vertical black line in panel A. See text for details. Full resolution version of this figure can be found on the OSF.

Parameter values for the comparisons shown in figure 9A (Top: critical scaling value; Bottom: max d′). Data shown is from the single subject who completed all comparisons. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed lines and shaded regions average across all shown images for this subject. Note that the luminance model, synthesized vs. synthesized: white noise comparison is not shown in this panel, because the data was poorly fit by this curve.

Critical scaling (posterior mean over all subjects and images) and number of statistics (as a percentage of number of input image pixels - a type of “compression rate”), for each model and comparison. Note that the critical scaling for original vs. synthesized comparisons does not result in the same number of statistics across models and, in particular, at their critical scaling values, all models have dimensionality less than that of the input image.

Critical scaling values for the two pooling models presented in this paper (Luminance and Energy) and the Texture model (originally tested in Freeman and Simoncelli (2011), data from Wallis et al. (2019), averaging across the two image classes). Solid points indicate the original vs. synthesized white noise comparisons, while hollow points indicate synthesized vs. synthesized white noise comparisons (for the luminance model, this is effectively infinite, since participants were unable to complete the task). For all three models, critical values are smaller in the original vs. synthesized comparison than the synthesized vs. synthesized one, and their ratio decreases with increasing complexity of image statistics. A potential explanation for this is that the more complex models approximate computations performed in deeper levels of the visual hierarchy, beyond which there are fewer remaining stages to discard information.

In the idealized version of the metamer paradigm, synthesized images broadly sample the space of possible model metamers. Thus, at large scaling values, synthesized images are distinguishable from each other and the target image, and they become indistinguishable from each other at the same scaling value for which they are indistinguishable from the target (left panel). In this case, model metamers are metameric with each other if and only if they are metameric with the target image. In our experements, however, this is not the case: at some scaling values, two synthesized images can be metameric with each other, but distinguishable from the target image (right panel).

The difference between the synth vs. synth and original vs. synth white noise comparisons decreases as model complexity increases (compare with figure 12). Because of the alignment between the luminance model metamer classes, the underlying perceptual metamer classes, and the white noise samples used to initialize the synthesis process, synthesized images always lie within the same perceptual metamer class and are thus never distinguishable. On the other hand, the texture model’s metamer classes grow orthogonally to the white noise samples, resulting in a much smaller critical scaling value for the synth vs. synth white noise comparison.

Schematics describing results presented in this paper. Unlike the idealized metamer paradigm (see section Why does critical scaling depend on the comparison being performed? and figure 12A), the critical scaling value for our models depends on the comparison being performed and, to a lesser extent, the image used to initialize synthesis. For the original vs. synthesized comparison, this does not affect the critical scaling value, as the synthesized images are always distinct from the target. However, initializing with white noise can result in synthesized images that lie in the same metamer class as each other even while they are distinct from the target, resulting in a relatively large scaling value for the synthesized vs. synthesized comparison. Initializing with natural images reduces the magnitude of this phenomenon.

Schematic of psychophysics task. Top shows the structure for a single trial: a single image is presented for 200 msec, bisected in the center by a gray bar, followed by a blank screen for 500 msec. The image is re-displayed, with a random half of the image changed to the comparison image, for 200 msec. The participants then have as long as needed to say which half of the image changed, followed by a 500 msec intertrial interval. Bottom table shows possible comparisons. In original vs. synthesized, one image was the target image whose model representation the synthesized images match (see figure 5), and the other was one such synthesized image. In synthesized vs. synthesized, both were synthesized images targeting the same original image, with the same model scaling, but different initialization. In experiments, dividing bar, blanks, and backgrounds were all midgray. For more details see text.

Relationship between the max d′ parameter, α and asymptotic performance. As max d′ increases beyond approximately 5 (where asymptotic performance is at ceiling), the slope of the psychophysical curve continues to increase (for example, compare the slope of the luminance and energy model original vs. synth white noise comparisons in figure 9A).