Neuroscience

Foveated metamers of the early visual system

William F Broderick author has email address
Gizem Rufo
Jonathan Winawer
Eero P Simoncelli

Flatiron Institute, Simons Foundation, New York, United States
Meta, Inc., Menlo Park, United States
Department of Psychology, New York University, New York, United States
Center for Neural Science, New York University, New York, United States
Courant Institute for Mathematical Sciences, New York University, New York, United States

https://doi.org/10.7554/eLife.90554.2

Open access
Copyright information

Figures and data

Schematic diagram of perceptual metamers.

Each panel contains a two-dimensional depiction of the set of all visual images: every image corresponds to a point in this space and every point in this space represents an image. Perceptual metamers — visual stimuli that cannot be reliably distinguished — are equivalence classes in this space, which we depict with a set of distinct non-overlapping regions, within which all stimuli are perceptually indistinguishable. Left: example stimulus (black point), and surrounding metameric stimuli (region enclosed by black polygon). Center: In a hierarchical visual system, in which each stage transforms the signals of the previous stage and discards additional information, every pair of stimuli that are metamers for an early visual area. 𝒩 ₁ are also metamers for the later visual area. 𝒩₂. Thus, metamers of earlier stages are nested within metamers of later stages. The converse does not hold: there are stimuli that. 𝒩₁ can distinguish but that. 𝒩₂ cannot. Right: Two stimulus families used for initialization of our metamer synthesis algorithm: white noise (distribution represented as a grayscale intensity map in the lower right corner) and natural images (distribution represented by the curved gray manifold). Typical white noise samples fall within a single perceptual metamer class (humans are unable to distinguish them). Natural images, on the other hand, are generally distinguishable from each other, as well as from un-natural images (those that lie off the manifold).

Two pooling models.

Both models compute local image statistics within a Gaussian-weighted window that is separable in a log-polar coordinate system, such that radial extent is approximately twice the angular extent (red contours indicate half-maximum levels). Windows are uniformly spaced in a log-polar grid, with centers separated by one standard deviation. A single scaling factor governs the size of all pooling windows. The luminance model (left) computes average luminance. The spectral energy model (right)computes average spectral energy at 4 orientation and 6 scales, as well as luminance, for a total of 25 statistics per window. Spectral energy is computed using the complex steerable pyramid constructed in the Fourier domain (Simoncelli and Freeman, 1995), squaring and summing across the real and imaginary components. Full resolution version of this figure can found on the OSF.

Metamers in human perception and pooling models.

Panels depict a two-dimensional stimulus space (see figure 1), with polygonal regions indicating groups of perceptually indistinguishable stimuli (human metamers). Pooling models ℳ⁽ⁱ⁾ determine the statistics that are pooled, with pooling extent controlled by scaling parameter, s. Each pooling model defines a set of model metamers: sets of stimuli that are physically different but whose model outputs are identical to those of the target (original) image T (example sets indicated by ellipses). We generate model metamer samples using an optimization procedure: starting from initial image I_k we adjust the pixel values until their pooled statistics match those of the target. The shapes and sizes of metameric regions (ellipses) depend on the model (i), the scaling parameter (s), the statistics of the target imagW , as well as the initial images (I) and the synthesis algorithm. For a given set of statistics θ, increasing the scaling value by factor α > 1 increases the size of the metamer set, and any stimulus that is a metamer for will also be a metamer for (i.e., larger desaturated ellipses contain the smaller saturated ellipses). Critical scaling is the largest s for which all model metamers are human metamers (smaller ellipses). Left: Model with parameters that produce approximate alignment with human metamers. Right: Model with different pooled statistics that yield metamers (blue ellipses) that are poorly aligned with human metamers: at critical scaling, there are human metamers that are not model metamers. This mismatch cannot be resolved by adjusting the scaling parameter: increasing s such that all human metamers are also model metamers (larger ellipse) will also yield model metamers that are not human metamers. See also Feather et al. (2023), figure 1.

Metamers in human perception and pooling models.

Panels depict a two-dimensional stimulus space (see figure 1), with polygonal regions indicating groups of perceptually indistinguishable stimuli (human metamers). Pooling models ℳ⁽ⁱ⁾ determine the statistics that are pooled, with pooling extent controlled by scaling parameter, s. Each pooling model defines a set of model metamers: sets of stimuli that are physically different but whose model outputs are identical to those of the target (original) image T (example sets indicated by ellipses). We generate model metamer samples using an optimization procedure: starting from initial image I_k we adjust the pixel values until their pooled statistics match those of the target. The shapes and sizes of metameric regions (ellipses) depend on the model (i), the scaling parameter (s), the statistics of the target imagW , as well as the initial images (I) and the synthesis algorithm. For a given set of statistics θ, increasing the scaling value by factor α > 1 increases the size of the metamer set, and any stimulus that is a metamer for will also be a metamer for (i.e., larger desaturated ellipses contain the smaller saturated ellipses). Critical scaling is the largest s for which all model metamers are human metamers (smaller ellipses). Left: Model with parameters that produce approximate alignment with human metamers. Right: Model with different pooled statistics that yield metamers (blue ellipses) that are poorly aligned with human metamers: at critical scaling, there are human metamers that are not model metamers. This mismatch cannot be resolved by adjusting the scaling parameter: increasing s such that all human metamers are also model metamers (larger ellipse) will also yield model metamers that are not human metamers. See also Feather et al. (2023), figure 1.

Example synthesized model metamers.

Top: Target image. Middle: Luminance model metamers, computed for two different scaling values (values as indicated, red ellipses to right of fixation indicate pooling window contours at half-max at that eccentricity). The left image is computed with a small scaling value, and is a perceptual metamer for most subjects: when fixating at the cross in the center of the stimulus, the two stimuli appear perceptually identical to the target image. Note, however, that when fixating in the periphery (e.g., the blue box), one can clearly see that the stimulus differs from the target (see enlarged versions of the foveal and peripheral neighborhoods to right). The right image is computed with a larger scaling value, and is no longer a erceptual metamer (for any choice of observer fixation). Bottom: Energy model metamers. Again, the left image is computed with a small scaling value and is a perceptual metamer for most observers when fixated on the center cross. Peripheral content (e.g., blue box) contains more complex distortions, readily visible when viewed directly. The right image, computed with a large scaling value, differs in appearance from the target regardless of observer fixation. Full resolution version of this figure can be found on the OSF.

Target images used in the experiments.

Images contain a variety of content, including textures, objects, and scenes. All are RAW camera images, with values proportional to luminance and quantized to 16 bits. Images were converted to grayscale, cropped to 2048 × 2600 pixels, displayed at 53.6 × 42.2 degrees, with intensity values rescaled to lie within the range of [0.05, 0.95] of the display intensities. All subjects saw target images 1–10, half saw 11–15, and half saw 16–20. A full resolution version of this figure can be found on the OSF.

Performance curves and their corresponding parameter values for different models and stimulus comparisons.

The luminance model has a substantially smaller critical scaling than the energy model, and original vs. synthesized comparisons yield smaller critical scaling values than synthesized vs. synthesized comparisons. (A) Psychometric functions, expressing probability correct as a function of scaling parameter, for both energy and luminance models (aqua and beige, respectively), and original vs. synthesized (solid line) and synthesized vs. synthesized (dashed line) comparisons. Data points represent average values across subjects and target images, 4320 trials per data point except for luminance model synthesized vs. synthesized comparison, which have only 180 trials per data point (one subject, five target images). Lines represent the posterior predictive means of fitted curves across subjects and target images, with the shaded region indicating the 95% high-density interval (HDI, Kruschke (2015)). (B) Estimated parameter values, separated by target image (left) or subject (right). Top row shows the critical scaling value and the bottom the value of the maximum d^′ parameter. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed lines and shaded regions the global means and 95% HDI. Note that the luminance model, synthesized vs. synthesized comparison is not shown, because the data are poorly fit (panel A, beige dashed line).

Comparison of two synthesized metamers is more diffcult than comparison of a synthesized metamer with the original image.

For the highest tested scaling value (1.5) the original vs. synthesized comparison is trivial while the synthesized vs. synthesized comparison is diffcult (energy model) or impossible (luminance model). Top: target image. Middle: Two luminance model metamers, generated from different initial uniform noise images. Bottom: Two energy model metamers, generated from different initial uniform noise images. All four of the model metamers can be easily distinguished from the natural image at top (original vs. synthesized), but are diffcult to distinguish from each other, despite the fact that their pooling windows have grown very large (synthesized vs. synthesized). Full resolution version of this figure can be found on the OSF.

The interaction between image content and model sensitivities greatly affects asymptotic performance (especially for the synthesized vs. synthesized comparison using the energy model) while critical scaling does not vary as much.

(A) Performance for each target image, averaged across subjects, comparing synthesized stimuli to natural image stimuli. Most target images show similar performance, with one obvious outlier whose performance never rises above 60%. Data points represent the average across subjects, 288 trials per data point for half the images, 144 per data point for the other half. Lines represent the posterior predictive means across subjects, with the shaded region giving the 95% HDI. (B) Example energy model metamers for two extreme target images. The top row (nyc) is the target image with the best performance (purple line in panel A), while the bottom row (llama) has the worst performance (red line in panel A). In each row, the leftmost image is the target image, and the next two show model metamers with the lowest and highest tested scaling values for this comparison. Full resolution version of this figure can be found on the OSF

Initializing model metamers with natural images does not affect performance in the original vs. synthesized comparison, but reduces critical scaling and increases max d^′ for the synthesized vs. synthesized comparison.

Note all data in this figure is for a single subject and 15 of the 20 target images. (A) Probability correct for one subject (sub-00), as a function of scaling. Each point represents the average of 540 trials (over all fifteen target images), except for the synthesized vs. synthesized luminance model white noise comparison (averaged over 5 target images). Vertical black line indicates scaling value where diffculty ran from chance to 100%, based on initialization and comparison, as discussed in panel B. (B) Three comparisons corresponding to the three psychophysical curves intersected by the vertical black line in panel A. The small image next to each model metamer shows the image used to initialize synthesis. See text for details. Full resolution version of this figure can be found on the OSF.

Parameter values for the comparisons shown in figure 9A (Top: critical scaling value; Bottom: max d^′).

Data shown is subject who completed all comparisons. Points represent the posterior means, shaded regions the 95% HDI, and horizontal dashed shaded regions average across all shown stimuli for this subject. Note that the luminance model, synthesized vs. synthesized: white comparison is not shown in this panel, because the data was poorly fit by this curve.

Critical scaling (posterior mean over all subjects and target images) and number of statistics (as a percentage of number of input image pixels - a type of "compression rate"), for each model and comparison.

Note that the critical scaling for original vs. synthesized comparisons does not result in the same number of statistics across models and, in particular, at their critical scaling values, all models have dimensionality less than that of the input image.

Critical scaling values for the two pooling models presented in this paper (Luminance and Energy) along with a Texture model (data from Freeman and Simoncelli (2011) and Wallis et al. (2019), averaging across the two image classes).

Filled points indicate original vs. synthesized comparisons, while unfilled points indicate synthesized vs. synthesized comparisons (for the luminance model, this is effectively infinite, since participants were unable to perform the task for any scaling value). For all three models, critical values are smaller in the original vs. synthesized comparison than the synthesized vs. synthesized one, but this effect decreases with increasing complexity of image statistics. Our critical scaling values for synth vs. synth comparisons of the energy model are consistent with those reported by Freeman and Simoncelli (2011).

Illustration of how critical scaling value can depend on comparison type.

Left: For the original vs. synthetic comparison, critical scaling corresponds to the largest ellipse such that the synthetic image is indistinguishable from (i.e., lies within the same perceptual region as) the target (original) image T. Right: Example configuration in which synthetic vs. synthetic comparisons lead to a larger critical scaling (with correspondingly larger ellipse). The two synthesized images are indistinguishable from each other, but not from the target.

Illustration of how discrepancies between the original vs. synth and synth vs. synth comparisons can be model-dependent.

Upper panels depict a hypothetical luminance model, for which pairs of synthesized stimuli are indistinguishable for all scaling values (i.e., they lie within the same perceptual metamer class for both the small ellipse on the left, and for larger ellipse on the right). In this case, critical scaling cannot be estimated (indicated by dashes in ellipse). Lower panels depict a hypothetical texture model, for which synth vs. synth comparisons yield a critical scaling value (right) only slightly larger than that obtained for original vs. synth (left). Energy model metamers (not shown) lie between these two extremes (see figure 11).

Illustration of how discrepancies between original vs. synth and synth vs. synth comparisons can depend on the image used to initialize synthesis.

Initializing with white noise (as is commonly done) can lead to a large difference between the two comparisons (see also figures 12 and 13). Initializing with a natural image can reduce the magnitude of the difference.

Schematic of psychophysics task.

Top shows the structure for a single trial: a single stimulus is presented for 200 msec, bisected in the center by a gray bar, followed by a blank screen for 500 msec. The stimulus is re-displayed, with a random half of the stimulus changed to the comparison stimulus, for 200 msec. The participants then have as long as needed to say which half of the stimulus changed, followed by a 500 msec intertrial interval. Bottom table shows possible comparisons. In original vs. synthesized, one stimulus was the target image stimulus whose model representation the synthesized stimuli match (see figure 5), and the other was one such synthesized stimulus. In synthesized vs. synthesized, both were synthesized stimuli targeting the same target image stimulus, with the same model scaling, but different initialization. In experiments, dividing bar, blanks, and backgrounds were all midgray. For more details see text.

Relationship between the max d ^′ parameter, α and asymptotic performance.

As max d ^′ increases beyond approximately 5 (where asymptotic performance is at ceiling), the slope of the psychophysical curve continues to increase (for example, compare the slope of the luminance and energy model original vs. synth white noise comparisons in figure 9A).

Sign up for email alerts