1. Neuroscience
Download icon

The lawful imprecision of human surface tilt estimation in natural scenes

  1. Seha Kim  Is a corresponding author
  2. Johannes Burge  Is a corresponding author
  1. University of Pennsylvania, United States
Research Article
  • Cited 2
  • Views 705
  • Annotations
Cite this article as: eLife 2018;7:e31448 doi: 10.7554/eLife.31448

Abstract

Estimating local surface orientation (slant and tilt) is fundamental to recovering the three-dimensional structure of the environment. It is unknown how well humans perform this task in natural scenes. Here, with a database of natural stereo-images having groundtruth surface orientation at each pixel, we find dramatic differences in human tilt estimation with natural and artificial stimuli. Estimates are precise and unbiased with artificial stimuli and imprecise and strongly biased with natural stimuli. An image-computable Bayes optimal model grounded in natural scene statistics predicts human bias, precision, and trial-by-trial errors without fitting parameters to the human data. The similarities between human and model performance suggest that the complex human performance patterns with natural stimuli are lawful, and that human visual systems have internalized local image and scene statistics to optimally infer the three-dimensional structure of the environment. These results generalize our understanding of vision from the lab to the real world.

https://doi.org/10.7554/eLife.31448.001

eLife digest

The ability to assess how tilted a surface is, or its ‘surface orientation’, is critical for interacting productively with the environment. For example, it helps organisms to determine whether a particular surface is better suited for walking or climbing. Humans and other animals estimate 3-dimensional (3D) surface orientations from 2-dimensional (2D) images on their retinas. But exactly how they calculate the tilt of a surface from the retinal images is not well understood.

Scientists have studied how humans estimate surface orientation by showing them smooth (often planar) surfaces with artificial markings. These studies suggested that humans very accurately estimate the direction in which a surface is tilted. But whether humans are as good at estimating surface tilt in the real world, where scenes are more complex than those tested in experiments, is unknown.

Now, Kim and Burge show that human tilt estimation in natural scenes is often inaccurate and imprecise. To better understand humans’ successes and failures in estimating tilt, Kim and Burge developed an optimal computational model, grounded in natural scene statistics, that estimates tilt from natural images. Kim and Burge found that the model accurately predicted how humans estimate tilt in natural scenes. This suggests that the imprecise human estimates are not the result of a poorly designed visual system. Rather, humans, like the computational model, make the best possible use of the information images provide to perform an estimation task that is very difficult in natural scenes.

The study takes an important step towards generalizing our understanding of human perception from the lab to the real world.

https://doi.org/10.7554/eLife.31448.002

Introduction

Understanding how vision works in natural conditions is a primary goal of vision research. One measure of success is the degree to which performance in a fundamental visual task can be predicted directly from image data. Estimating the 3D structure of the environment from 2D retinal images is just such a task. However, relatively little is known about how the human visual system estimates 3D surface orientation from images of natural scenes.

3D surface orientation is typically parameterized by slant and tilt. Slant is the amount by which a surface is rotated away from an observer; tilt is the direction in which the surface is rotated (Figure 1A). Compared to slant, tilt has received little attention, even though both are critically important for successful interaction with the 3D environment. For example, even if slant has been accurately estimated, humans must estimate tilt to determine where they can walk. Surface with tilts of 90°, like the ground plane, can sometimes be walked on. Surfaces with tilts of 0° or 180°, like the sides of tree trunks, can never be walked on.

Tilt and slant, natural scene database, and tilt prior.

(A) Tilt is the direction of slant. Slant is the amount of rotation out of the reference (e.g., frontoparallel) plane. (B) Example stereo-image pair (top) and corresponding stereo-range data (bottom). The gauge figure indicates local surface orientation. To see the scene in stereo 3D, free-fuse the left-eye (LE) and right-eye (RE) images. (C) Prior distribution of unsigned tilt in natural scenes, computed from 600 million groundtruth tilt samples in the natural scene database (see Materials and methods). Cardinal surface tilts associated with the ground plane (90°) and tree trunks (0° and 180°) occur far more frequently than oblique tilts in natural scenes. Unsigned tilt, τ=[0,180), indicates 3D surface orientation up to a sign ambiguity (i.e., tilt modulo 180°).

https://doi.org/10.7554/eLife.31448.003

Numerous psychophysical, computational, and neurophysiological studies have probed the human ability to estimate surface slant, surface tilt, and 3D shape. Systematic performance has been observed, and models have been developed that nicely describe performance. Most previous studies have used stimuli having planar (Stevens, 1983;Knill, 1998a, 1998b; Hillis et al., 2004; Burge et al., 2010a; Rosenholtz and Malik, 1997; Rosenberg et al., 2013; Murphy et al., 2013; Velisavljević and Elder, 2006; Saunders and Knill, 2001; Welchman et al., 2005; Sanada et al., 2012; Tsutsui et al., 2001) or smoothly curved (Todd et al., 1996; Fleming et al., 2011; Todd, 2004;Marlow et al., 2015; Li and Zaidi, 2000, 2004; Norman et al., 2006) surface shapes and regular (Knill, 1998a, 1998b; Hillis et al., 2004; Watt et al., 2005; Rosenholtz and Malik, 1997; Rosenberg et al., 2013; Murphy et al., 2013; Velisavljević and Elder, 2006; Li and Zaidi, 2000, 2004; Welchman et al., 2005) or random-patterned (Burge et al., 2010a; Fleming et al., 2011) surface markings. These stimuli are not representative of the variety of surface shapes and markings encountered in natural viewing. Surfaces in natural scenes often have complex surface geometries and are marked by complicated surface textures. Thus, performance with simple artificial scenes may not be representative of performance in natural scenes. Also, models developed with artificial scenes often generalize poorly (or cannot even be applied) to natural scenes. These issues concern not just studies of 3D surface orientation perception but vision and visual neuroscience at large.

Few studies have examined the human ability to estimate 3D surface orientation using natural photographic images, the stimuli that our visual systems evolved to process. None, to our knowledge, have done so with high-resolution groundtruth surface orientation information. There are good reasons for this gap in the literature. Natural images are complex and difficult to characterize mathematically, and groundtruth data about natural scenes are notoriously difficult to collect. Research with natural stimuli has often been criticized (justifiably) on the grounds that natural stimuli are too complicated or too poorly controlled to allow strong conclusions to be drawn from the results. The challenge, then, is to develop experimental methods and computational models that can be used with natural stimuli without sacrificing rigor and interpretability.

Here, we report an extensive examination of human 3D tilt estimation from local image information with natural stimuli. We sampled thousands of natural image patches from a recently collected stereo-image database of natural scenes with precisely co-registered distance data (Figure 1B) (Burge et al., 2016). Groundtruth surface orientation was computed directly from the distance data (see Materials and methods). Human observers binocularly viewed the natural patches and estimated the tilt at the center of each patch. The same human observers also viewed artificially-textured planar stimuli matched to the groundtruth tilt, slant, distance, and luminance contrast of the natural stimuli. First, we compared human performance with natural and matched artificial stimuli. Then, we compared human performance to the predictions of an image-computable normative model, a Bayes’ optimal observer, that makes the best possible use of the available image information for the task. This experimental design enables direct, meaningful comparison of human performance across stimulus types, allowing the isolation of important stimulus differences and the interpretation of human response patterns with respect to principled predictions provided by the model.

A rich set of results emerges. First, tilt estimation in natural scenes is hard; compared to performance with artificial stimuli, performance with natural stimuli is poor. Second, with natural stimuli, human tilt estimates cluster at the cardinal tilts (0°, 90°, 180° and 270°), echoing the prior distribution of tilts in natural scenes (Figure 1C) (Burge et al., 2016; Yang and Purves, 2003a;Yang and Purves, 2003b; Adams et al., 2016). Third, human estimates tend to be more biased and variable when the groundtruth tilts are oblique (e.g., 45°). Fourth, at each groundtruth tilt, the distributions of human and model errors tend to be very similar, even though the error distributions themselves are highly irregular. Fifth, human and model observer trial-by-trial errors are correlated, suggesting that similar (or strongly correlated) stimulus properties drive both human and ideal performance. Together, these results represent an important step towards the goal of being able to predict human percepts of 3D structure directly from photographic images in a fundamental natural task.

Results

Human observers binocularly viewed thousands of randomly sampled patches of natural scenes; they viewed an equal number of stimuli at each of 24 tilt bins between 0° and 360°. The stimuli were presented on a large (2.0 × 1.2 m) stereo front-projection system positioned 3 m from the observer. This relatively long viewing distance minimizes focus cues to flatness. Except for focus cues, the display system recreates the retinal images that would have been formed by the original scene. Each scene was viewed binocularly through a small virtual aperture (1° or 3° of visual angle) positioned 5 arcmin of disparity in front of the sampled point in the scene (Figure 2A); the viewing situation is akin to looking at the world through a straw (McDermott, 2004). Patches were displayed at the random image locations from which they were sampled. Observers reported, using a mouse-controlled probe, the estimated surface tilt at the center of each patch (Figure 2B). We pooled data across human observers and aperture sizes and converted the tilt estimates to unsigned tilt for analysis (signed tilt modulo 180°) because the estimation of unsigned tilt was similar for all observers and aperture sizes (Figure 2—figure supplement 1, Figure 2—figure supplement 2). The same observers also estimated surface tilt with an extensive set of artificial planar stimuli that were matched to the tilts, slants, distances, and luminance contrasts of the natural stimuli presented in the experiment. (Each planar artificial stimulus had one of three texture types: 1/f noise, 3.5 cpd plaid, and 5.25 cpd plaid; Figure 2—figure supplement 3.) Thus, any observed performance differences between natural and artificial stimuli cannot be attributed to these dimensions.

Figure 2 with 5 supplements see all
Experimental stimuli and human tilt responses.

(A) The virtual viewing situation. (B) Example natural stimulus (ground plane) and artificial stimulus (3.5cpd plaid). See Figure 2—figure supplement 3 for all three types. The task was to report the tilt at the center of the small (1° diameter) circle. Aperture sizes were either 3° (shown) or 1° (not shown) of visual angle. Observers set the orientation of the probe (circle and line segments) to indicate estimated tilt. Free-fuse to see in stereo 3D. (C) Raw responses for every trial in the experiment. (D) Histogram of raw responses (unsigned estimates). The dashed horizontal line shows the uniform distribution of groundtruth tilts presented in the experiment. (Histograms of signed tilt estimates are shown in Figure 2—figure supplement 4.) (E) Estimate means and (F) estimate variances as a function of groundtruth tilt. Human tilt estimates are more biased and variable with natural stimuli (top) than with artificial stimuli (bottom). Data are combined across all three artificial texture types; see Figure 2—figure supplement 3 for performance with each individual texture type. With artificial stimuli, human estimates are unbiased and estimate variance is low. Model observer predictions (minimum mean squared error [MMSE] estimates; black curves) parallel human performance with natural stimuli.

https://doi.org/10.7554/eLife.31448.004

Natural and artificial stimuli elicited strikingly different patterns of performance (Figure 2C). Although many stimuli of both types elicit tilt estimates τ^ that approximately match the groundtruth tilt (data points on the unity line), a substantial number of natural stimuli elicit estimates that cluster at the cardinal tilts (data points at τ^=0,90,180,270). No such clustering occurs with artificial stimuli. The histogram of the human tilt estimates explicitly shows the clustering, or lack thereof (Figure 2D). With natural stimuli, the distribution of unsigned estimates p(τ^) peaks at 0° and 90° and has a similar shape to the prior distribution of groundtruth tilts in the natural scene database (Figure 1C; also see Figure 2—figure supplement 4). If the database is representative of natural scenes, then one might expect the human visual system to use the natural statistics of tilt as a tilt prior in the perceptual processes that convert stimulus measurements into estimates. Standard Bayesian estimation theory predicts that the prior will influence estimates more when measurements are unreliable and will influence estimates less when measurements are reliable (Knill and Richards, 1996).

We summarized 3D tilt estimation performance by computing the mean and variance of the tilt estimates τ^ as a function of groundtruth tilt (Figure 2E,F). (The mean and variance were computed using circular statistics because tilt is an angular variable; see Materials and methods.) These summary statistics change systematically with groundtruth tilt, exhibiting patterns reminiscent of the 2D oblique effect (Appelle, 1972; Furmanski and Engel, 2000; Girshick et al., 2011). With natural stimuli, estimates are maximally biased at oblique tilts and unbiased at cardinal tilts; estimate variance is highest at oblique tilts (~60° and ~120°) and lowest at cardinal tilts. With artificial stimuli, estimates are essentially unbiased and are less variable across tilt. The unbiased responses to artificial stimuli imply that the biased responses to natural stimuli accurately reflect biased perceptual estimates, under the assumption that the function that maps perceptual estimates to probe responses is stable across stimulus types (see Materials and methods). (See Figure 2—figure supplement 3 for performance with each individual artificial texture type.) The summary statistics reveal clear differences between the stimulus types. However, there is more to the data than the summary statistics can reveal. Thus, we analyzed the raw data more closely.

The probabilistic relationship between groundtruth tilt τ and human tilt estimates τ^ is shown in Figures 3 and 4. Each subplot in Figure 3A shows the distribution of estimation errors p(τ^τ|τ) for a different groundtruth tilt. With artificial stimuli, estimation errors e=τ^τ are unimodally distributed and peaked at zero (black symbols). With natural stimuli, estimation errors are more irregularly distributed, and the peak locations change systematically with groundtruth tilt (white points). With cardinal groundtruth tilts (e.g., τ=0 or τ=90), the error distributions peak at zero and large errors are rare. With oblique groundtruth tilts (e.g., τ=60 or τ=120), the error distributions tend to be bi-modal with two prominent peaks at non-zero errors. For example, when groundtruth tilt τ=60, the most common errors were −60° and 30°. These errors occurred because observers incorrectly estimated the tilt to be 0° or 90°, respectively, when the correct answer was 60º. Thus, at this groundtruth tilt, the human observers frequently (and incorrectly) estimated cardinal tilts instead of the correct oblique tilt.

Distribution of tilt estimation errors for different groundtruth tilts.

(A) Conditional error distributions p(τ^τ | τ) are obtained by binning estimates for each groundtruth tilt (vertical bins in Figure 3B) and subtracting the groundtruth tilt. With artificial stimuli, the error distributions are centered on 0° (black symbols). With natural stimuli, the error distributions change systematically with groundtruth tilt (white symbols). For cardinal groundtruth tilts (0° and 90°), the most common error is zero. For oblique tilts, the error distributions peak at values other than zero (e.g., arrows in τ=60 and τ=120 subplots). The irregular error distributions are nicely predicted by the MMSE estimator (black curve); shaded regions show 95% confidence intervals on the MMSE estimates from 1000 Monte Carlo simulations of the experiment (see Materials and methods). The MMSE estimator predicts human performance even though zero free parameters were fit to the human responses. (B) Raw unsigned tilt estimates with natural stimuli (same data as Figure 2C, but shown in the unsigned tilt domain). The rectangular box shows estimates in the τ=60 tilt bin.

https://doi.org/10.7554/eLife.31448.010
Figure 4 with 1 supplement see all
Distribution of groundtruth tilts for different tilt estimates.

(A) Conditional distributions of groundtruth tilt p(τ | τ^) are obtained by binning groundtruth tilts for each estimated tilt (horizontal bins in Figure 4B). Unlike the conditional error distributions, these distributions are similar with natural and artificial stimuli. The most probable groundtruth tilt, conditional on the estimate, peaks at the estimated tilt for both stimulus types. Thus, any given estimate is a good indicator of the groundtruth tilt despite the overall poorer performance with natural stimuli. Also, these conditional distributions are well accounted for by the MMSE estimates; shaded regions show 95% confidence intervals on the MMSE estimates from 1000 Monte Carlo simulations of the experiment (see Materials and methods). The MMSE model had zero free parameters to fit to human performance. (B) Raw unsigned tilt estimates (same data as Figure 2C, but shown in the unsigned tilt domain). The box shows groundtruth tilts in the τ=60 estimated tilt bin.

https://doi.org/10.7554/eLife.31448.011

Tilt estimates from natural stimuli are less accurate at oblique than at cardinal groundtruth tilts. Does this fact imply that oblique tilt estimates (e.g., τ^=60) provide less accurate information about groundtruth tilt than cardinal tilt estimates (e.g., τ^=90)? No. Each panel in Figure 4A shows the distribution of groundtruth tilts p(τ|τ^) for each estimated tilt. The most probable groundtruth tilt equals the estimated tilt, and the variance of each distribution is approximately constant, regardless of whether the estimated tilt is cardinal or oblique. Furthermore, the estimates from natural and artificial stimuli provide nearly equivalent information about groundtruth (see also Figure 4—figure supplement 1). Thus, even though tilt estimation performance is far poorer at oblique than at cardinal tilts and is far poorer with natural than with artificial stimuli, all tilt estimates regardless of the value are similarly good predictors of groundtruth tilt.

How can it be that low-accuracy estimates from natural stimuli predict groundtruth nearly as well as high-accuracy estimates from artificial stimuli? Some regions of natural scenes yield high-reliability measurements that make tilt estimation easy; other regions of natural scenes yield low-reliability measurements that make tilt estimation hard. When measurements are reliable, the prior influences estimates less; when measurements are unreliable, the prior influences estimates more. Thus, cardinal tilt estimates can result either from reliable measurements of cardinal tilts or from unreliable measurements of oblique tilts. On the other hand, oblique tilt estimates can only result from reliable measurements of oblique tilts, because the measurements must be reliable enough to overcome the influence of the prior. All these factors combine to make each tilt estimate, regardless of its value, an equally reliable predictor of groundtruth tilt. The uniformly reliable information provided by the estimates about groundtruth (see Figure 4A) may simplify the computational processes that optimally pool local estimates into global estimates (see Discussion). The generality of this phenomenon across natural tasks remains to be determined. However, we speculate that it may have widespread importance for understanding perception in natural scenes, as well as in other circumstances where measurement reliability varies drastically across spatial location.

Normative model

We asked whether the complicated pattern of human performance with natural stimuli is consistent with optimal information processing. To answer this question, we compared human performance to the performance of a normative model, a Bayes optimal observer that optimizes 3D tilt estimation in natural scenes given a squared error cost function (Burge et al., 2016). The model takes three local image cues C as input — luminance, texture, and disparity gradients — and returns the minimum mean squared error (MMSE) tilt estimate τ^MMSE as output. (The MMSE estimate is the mean of the posterior probability distribution over groundtruth tilt given the measured image cues.)

To determine the optimal estimate for each possible triplet of cue values, we use the natural scene database. At each pixel in the database, the image cues are computed directly from the photographic images within a local area, and the groundtruth tilt is computed directly from the distance data (see Materials and methods; [Burge et al., 2016]). In other words, the model is ‘image-computable’: the model computes the image cues from image pixels and produces tilt estimates as outputs.

We approximate the posterior mean E[τ|C]=ττp(τ|C) by computing the sample mean of the groundtruth tilt conditional on each unique image cue triplet (Figure 5A). The result is a table, or ‘estimate cube,’ where each cell stores the optimal estimate τ^MMSE=E[τ|C] for a particular combination of image cues (Figure 5B).

Figure 5 with 1 supplement see all
Normative model for tilt estimation in natural scenes.

(A) The model observer estimate is the minimum mean squared error (MMSE) tilt estimate τ^MMSE given three image cue measurements. Optimal estimates are approximated from 600 million data points (90 stereo-images) in the natural scene database: image cue values are computed directly from the photographic images and groundtruth tilts are computed directly from the distance data. (B) MMSE estimates for ~260,000 (643) unique image cue triplets are stored in an ‘estimate cube.’ (C) Model observer estimates for the 3600 unique natural stimuli used in the experiment. For each stimulus used in the experiment, the image cues are computed, and the MMSE estimate is looked up in the ‘estimate cube.’ Excluding the 3600 experimental stimuli from the 600 million stimuli that determined the estimate cube has no impact on predictions. The optimal estimates within the estimate cube change smoothly with the image cue values; hence, a relatively small number of samples can explore the structure of the full 3D space and provide representative performance measures (see Discussion). (D) Proportion variance explained (R2) by the normative model for the summary statistics (estimate counts, means, and variances; Figure 2D–F) and the conditional distributions (Figures 3 and 4). All R2 values are highly significant (p<106).

https://doi.org/10.7554/eLife.31448.013

In the cue-combination literature, cues are commonly assumed to be statistically independent (Ernst and Banks, 2002). In natural scenes, it is not clear whether this assumption holds. Fortunately, the normative model used here is free of assumptions about statistical independence and the form of the joint probability distribution (see Discussion). Thus, our normative model provides a principled benchmark, grounded in natural scene statistics, against which to compare human performance.

We tested the model observer on the exact same set of natural stimuli used to test human observers (Figure 5C). The model observer predicts the overall pattern of raw human responses (see also Figure 5—figure supplement 1). More impressively, the model observer predicts the counts, means, and variances of the human tilt estimates (Figure 2D–F), the conditional error distributions (Figure 3), and the conditional groundtruth tilt distributions (Figure 4). The model explains a large proportion of the variance for all of these performance measures (Figure 5D). These results indicate that human visual system estimates tilt in accordance with optimal processes that minimize error in natural scenes. We conclude that the biased and imprecise human tilt estimates with natural stimuli are nevertheless lawful.

Two points are worth emphasizing. First, this model observer had no free parameters that were fit to the human data (Burge et al., 2016); instead, the model observer was designed to perform the task optimally given the three image cues. Second, the close agreement between human and model performance suggests that humans use the same cues (or cues that strongly correlate with those) used by the normative model (see Discussion).

Trial-by-trial error

If human and model observers use the same cues in natural stimuli to estimate tilt, variation in the stimuli should cause similar variation in performance. Are human performance and model observer performance similar in individual trials? The same set of natural stimuli was presented to all observers. Thus, it is possible to make direct, trial-by-trial comparisons of the estimation errors that each observer made. If the properties of individual natural stimuli influence estimates similarly across observers, then observer errors across trials should be correlated. Accounting for trial-by-trial errors is one of the most stringent comparisons that can be made between model and human performance.

Natural stimuli do elicit similar trial-by-trial errors from human and model observers (Figure 6A). The model predicts trial-by-trial human errors far better than chance. We quantify the model-human similarity with the circular correlation coefficients of the trial-by-trial model and human estimates (Figure 6B). The correlation coefficients are significant. This result implies that the errors are systematically and reliably dependent on the properties of natural stimuli and that these properties affect human and model observers similarly.

Figure 6 with 2 supplements see all
Trial-by-trial estimation errors: normative model vs. human observers.

The diagonal structure in the plots indicates that trial-by-trial errors are correlated. (A) Raw trial-by-trial errors with natural stimuli between model and human observers. (B) Correlation coefficients (circular) for trial-by-trial errors between model and each human observer. The error bars represent 95% confidence intervals from 1000 bootstrapped samples of the correlation coefficient. The dashed line shows the mean of the correlation coefficients of errors between human observers in natural stimuli (Figure 6—figure supplement 1). (C) Bias-corrected errors in natural stimuli. (D) Correlation coefficient for bias-corrected errors.

https://doi.org/10.7554/eLife.31448.015

However, because both human and model observers produced biased estimates with natural stimuli (Figure 2E, Figure 2—figure supplement 2), it is possible that the biases are responsible for the error correlations. To remove the possible influence of bias, we computed the bias-corrected error. On each trial, we subtracted the observer bias at each groundtruth tilt e=(τ^τ)errorE(τ^τ|τbias) from the raw error. Human and model bias-corrected errors are also significantly correlated (Figure 6C,D). The human-human correlation (dashed line in Figure 6B,D; see Figure 6—figure supplement 1) sets an upper bound for the model-human correlation. The model-human correlation approaches this bound in some cases. Other measures of trial-by-trial similarity (e.g., choice probability; Figure 6—figure supplement 2C) yield similar conclusions. These results show that natural stimulus variation at a given groundtruth tilt causes similar response variation in human observers and the model observer.

To ensure that the predictive power of the model observer is not trivial, we developed multiple alternative models. All other models predict human performance more poorly (Figure 6—figure supplement 2). Our results do not rule out the possibility that another model could predict human performance better, but the current MMSE estimator establishes a strong benchmark against which other models must be compared.

Thus, the normative model, without fitting to the human data, accounts for human tilt estimates at the level of the summary statistics (Figure 2D–F), the conditional distributions (Figure 3 and Figure 4), and the trial-by-trial errors (Figure 6). Together, this evidence suggests that the human visual system’s perceptual processes and the normative model’s computations are making similar use of similar information. We conclude that the human visual system makes near-optimal use of the available information in natural stimuli for estimating 3D surface tilt.

Performance-impacting stimulus factors: Slant, distance, and natural depth variation

In our experiment, natural and artificial stimuli were matched on many dimensions: tilt, slant, distance, and luminance contrast. These stimulus factors are commonly controlled in perceptual experiments. Consistent with previous reports, slant and distance had a substantial impact on estimation error (Watt et al., 2005) with both natural and artificial stimuli (Figure 7). (Luminance contrast had little impact on performance.)

The effect of slant and distance on tilt estimation error in natural stimuli for human and model observers.

(A) Absolute error decreases linearly with slant. Estimation error decreases approximately 20° as slant changes from 30° to 60°. (B) Absolute error increases linearly with distance. Estimation error increases approximately 15° as distance increases from 3 m to 30 m.

https://doi.org/10.7554/eLife.31448.018

Even after controlling for these stimulus dimensions, tilt estimation with natural stimuli is considerably poorer than tilt estimation with artificial stimuli. Other factors must therefore account for the differences. What are they? In our experiment, each artificial scene consisted of a single planar surface. Natural scenes contain natural depth variation (i.e., complex surface structure); some surfaces are approximately planar, some are curved or bumpy. How are differences in surface planarity related to differences in performance with natural and artificial scenes? To quantify the departure of surface structure from planarity, we defined local tilt variance as the circular variance of the groundtruth tilt values in the central 1° area of each stimulus. Then, we examined how estimation error changes with tilt variance.

First, we found that estimation error increases linearly with tilt variance for both human and model observers (Figure 8A). Unfortunately, tilt variance co-varies with groundtruth tilt — cardinal tilts tend to have lower tilt variance than oblique tilts, presumably because of the ground plane (Figure 8B)— which means that the effect of groundtruth tilt could be misattributed to tilt variance. Hence, we repeated the analysis of overall error separately for cardinal tilts alone and for oblique tilts alone. We found that the effect of tilt variance is independent of groundtruth tilt (Figure 8C). Thus, like slant and distance, tilt variance (i.e., departure from surface planarity) is one of several key stimulus factors that impacts tilt estimation performance.

Figure 8 with 1 supplement see all
The effect of tilt variance on tilt estimation error.

(A) Absolute error increases linearly with tilt variance. Estimation error increases approximately 25° across the range of tilt variance. Artificial stimuli were perfectly planar and had zero local depth variation; hence the individual data point at zero tilt variance. Solid curve shows the model prediction. (B) Tilt variance co-varies with groundtruth tilt. Oblique tilts tend to be associated with less planar (i.e., more bumpy) regions of natural scenes. (Tilt variance was computed in 15° wide bins.) (C) Same as (A) but conditional on whether groundtruth tilts are cardinal (red, 0° ± 22.5° or 90° ± 22.5°) or oblique (blue, 45° ± 22.5° or 135° ± 22.5°, shaded areas in [B]). Data points are spaced unevenly because they are grouped in quantile bins, such that each data point represents an equal number of stimuli. The solid curves represent the errors of the MMSE estimator for cardinal (red) and oblique (blue) groundtruth tilts. The normative model predicts performance in all cases.

https://doi.org/10.7554/eLife.31448.019

Second, we found that for near-planar natural stimuli, average estimation error with natural and artificial stimuli are closely matched (left-most points in Figure 8A). Does this result mean that tilt variance accounts for all performance differences between natural and artificial stimuli? No. Performance with near-planar natural stimuli is still substantially different from performance with artificial stimuli (Figure 8—figure supplement 1). In addition, individual human and model trial-by-trial estimation errors are still correlated for the near-planar natural stimuli. Furthermore, the patterns of human performance with natural stimuli are robust across a wide range of tilt variance. Figure 9 shows the summary statistics (estimate counts, means, and variances; cf. Figure 2D–F) for multiple different tilt variances of human observers. Model performance is also similarly robust to tilt variance (Figure 9—figure supplement 1).

Figure 9 with 1 supplement see all
Robustness of performance measures to tilt variance.

Human tilt estimation performance with natural stimuli for five tilt variance quintiles (colors). The quintile centers are at 0.12, 0.33, 0.55, 0.76, and 0.97, respectively. (A) Estimate count ratio (i.e., the ratio of estimated to presented tilt) at each tilt. With near-planar natural stimuli, cardinal tilts are still estimated much more frequently than with planar artificial stimuli. (B) Estimate means. (C) The variance of estimates. Except with the highest tilt variance stimuli, the patterns of mean and variance with natural stimuli hold across tilt variances, except for natural stimuli with the highest tilt variance.

https://doi.org/10.7554/eLife.31448.021

We conclude that although tilt variance is an important performance-modulating factor, it is not the only factor responsible for performance differences with natural and artificial stimuli. Other factors must be responsible. Understanding these other factors is an important direction for future work.

Discussion

Estimating 3D surface orientation requires the estimation of both slant and tilt. The current study focuses on tilt estimation. We quantify performance in natural scenes and report that human tilt estimates are often neither accurate nor precise. To connect our work to the classic literature, we matched artificial and natural stimuli on the stimulus dimensions that are controlled most often in typical experiments. The comparison revealed systematic performance differences. The detailed patterns of human performance are predicted, without free parameters to fit the data, by a normative model that is grounded in natural scene statistics and that makes the best possible use of the available image information. Importantly, this model is distinguished from many models of mid-level visual tasks because it is ‘image computable’; that is, it takes image pixels as input and produces tilt estimates as output. Together, the current experiment and modeling effort contributes to a broad goal in vision and visual neuroscience research: to generalize our understanding of human vision from the lab to the real world.

Generality of conclusions and future directions

Influence of full-field viewing

The main experiment examined tilt estimation performance for small patches of 3D natural scenes (1° and 3° of visual angle). Does tilt estimation performance improve substantially with full-field viewing of the 3D natural scenes? We re-ran the experiment with full-field viewing (36° x 21°; see Figure 1B for an example full-field scene). We found that human performance is essentially the same (Figure 2—figure supplement 5). Although it may seem surprising that full-field viewing does not substantially improve performance, it makes sense. Scene structure is correlated only over a local area. Except for the ground plane, it is unusual for surfaces to have constant orientation over large visual angles. Thus, scene locations far from the target add little information about local tilt.

Influence of scale

Groundtruth surface orientation is computed from a locally planar approximation to the surface structure, but surfaces in natural scenes are generally non-planar. Hence, the area over which groundtruth tilt is computed can affect the values assigned to each surface location. The same is true of the local image cue values. We checked how sensitive our results are to the scale of the local analysis area. We recomputed groundtruth tilt for two scales and recomputed image cue values for four scales (see Materials and methods). All eight combinations of scales yield the same qualitative pattern of results.

Influence of gaze angle

The statistics of local surface orientation change with elevation in natural scenes (Adams et al., 2016; Yang and Purves, 2003b). In our study, scene statistics were computed from range scans and stereo-images (36° x 21° field-of-view) that were captured from human eye height with earth parallel gaze (Burge et al., 2016). Different results may characterize other viewing situations, a possibility that could be evaluated in future work. However, the vast majority of eye movements in natural scenes are smaller than 10° (Land and Hayhoe, 2001; Pelz and Rothkopf, 2007; Dorr et al., 2010). Hence, the results presented here are likely to be representative of an important subset of conditions that occur in natural viewing.

Influence of internal noise

We examined how well the normative model (i.e., MMSE estimator) predicts human performance with artificial stimuli. The model nicely predicts the unbiased pattern of human estimate means. However, the model predicts estimate variances that are lower than the human estimate variances that we observed (although the predicted and observed patterns are consistent). We do not yet understand the reason for this discrepancy. One possibility is that the normative model used here does not explicitly model how internal noise affects human performance. In natural scenes, natural stimulus variability may swamp internal noise and be the controlling source of uncertainty. But with artificial stimuli, an explicit model of internal noise may be required to account quantitatively for the variance of human performance. Determining the relative importance of natural stimulus variability and internal noise is an important topic for future work.

Influence of sampling error

The natural stimuli presented in the experiment were chosen via constrained random sampling (see Materials and methods). Random stimulus sampling increases the likelihood that the reported performance levels are representative of generic natural scenes. One potential concern is that the relatively small number of unique stimuli that can be practically used in an experiment (e.g., n = 3600 in this experiment) precludes a full exploration of the space of optimal estimates (see Figure 5B). Fortunately, the tilt estimates from the normative model change smoothly with image cue values. Systematic sparse sampling should thus be sufficient to explore the space. To rigorously determine the influence of each cue on performance, future parametric studies should focus on the role of particular image cue combinations and other important stimulus dimensions such as tilt variance.

Influence of non-optimal cues

Although the three local image cues used by the normative model are widely studied and commonly manipulated, there is no guarantee that they are the most informative cues in natural scenes. Automatic techniques could be used to find the most informative cues for the task (Geisler et al., 2009; Burge and Jaini, 2017;Jaini and Burge, 2017). These techniques have proven useful for other visual estimation tasks with natural stimuli (Burge and Geisler, 2011Burge and Geisler, 2012, 2014, 2015). However, in the current task, we speculate that different local cues are unlikely to yield substantially better performance (Burge et al., 2016). Also, given the similarities between human and model observer performance, any improved ability to predict human performance is likely to be modest at best. Nevertheless, the only way to be certain is to check.

3D surface orientation estimation

The estimation of the 3D structure of the environment is aided by the joint estimation of tilt and slant (Marr’s ‘2.5D sketch’) (Marr, 1982). Although we have shown that human and model tilt estimation performance are systematically affected by surface slant (Figure 7A), the current work only addresses the human ability to estimate unsigned tilt. We have not yet explicitly modeled how humans estimate signed tilt, how humans estimate slant, or how humans jointly estimate slant and tilt. We will attack these problems in the future.

Cue-combination with and without independence assumptions

The standard approach to modeling cue-combination, sometimes known as maximum likelihood estimation, includes a number of assumptions: a squared error cost function, cue independence, unbiased Gaussian-distributed single cue estimates, and a flat or uninformative prior (Ernst and Banks, 2002) (but see [Oruç et al., 2003]). The approach used here (normative model; see Figure 5) assumes only a squared error cost function, and is guaranteed to produce the Bayes optimal estimate given the image cues, regardless of the common assumptions . In natural scenes, it is often unclear whether the common assumptions hold. Methods with relatively few assumptions can therefore be powerful tools for establishing principled predictions. We have not yet fully investigated how the image cues are combined in tilt estimation, but we have conducted some preliminarily analyses. For example, a simple average of the single-cue estimates (each based on luminance, texture, or disparity alone) underperforms the three-cue normative model. This result is not surprising given that the individual cues are not independent, that the single cue estimates do not follow Gaussian distribution, and that the tilt prior is not flat. However, the current study is not specifically designed to examine the details of cue combination in tilt estimation. To examine cue-combination in this task rigorously, a parametric stimulus-sampling paradigm should be employed, a topic that will be explored in future work.

Local and global tilt estimation

A grand problem in perception and neuroscience research is to understand how local estimates are grouped into more accurate global estimates. We showed that local tilt estimates are unbiased predictors of groundtruth tilt and have nearly equal reliability (Figure 4). This result implies that optimal spatial pooling of the local estimates may be relatively simple. Assuming statistical independence (i.e., naïve Bayes), optimal spatial pooling is identical to a simple linear combination of the local estimates: the straight average of N local estimates τ^global=1NiNτ^ilocal. Of course, local groundtruth tilts and estimates are spatially correlated, so the independence assumption will not be strictly correct. However, the spatial correlations could be estimated from the database and incorporated into the computations. Our work thus lays a strong empirically grounded foundation for the investigation of local-global processing in surface orientation estimation.

Behavioral experiments with natural images

In classic studies of surface orientation perception, stimuli are usually limited in at least one of two important respects. If the stimuli are artificial (e.g., computer-graphics generated), groundtruth surface orientation is known but lighting conditions and textures are artificial, and it is uncertain whether results obtained with artificial stimuli will generalize to natural stimuli. If the stimuli are natural (e.g., photographs of real scenes), groundtruth surface orientation is typically unknown which complicates the evaluation of the results. The experiments reported here used natural stereo-images with laser-based measurements of groundtruth surface orientation, and artificial stimuli with tilt, slant, distance, and contrast matched to the natural stimuli. This novel design allows us to relate our results to the classic literature, to determine the generality of results with both natural and artificial stimuli and to isolate performance-controlling differences between the stimuli. In particular, we found that tilt variance is a pervasive performance-altering feature of natural scenes that is not explicitly considered in most investigations. The human visual system must nevertheless contend with tilt variance in natural viewing. We speculate that characterizing its impact is likely to be fundamental for understanding 3D surface orientation estimation in the real-world, just as characterizing the impact of local luminance contrast has been important for understanding how humans detect spatial patterns in noise (Burgess et al., 1981).

Perception and the internalization of natural scene statistics

The current study is the latest in a series of reports that have attempted, with ever increasing rigor, to link properties of perception to the statistics of natural images and scenes. Our contribution extends previous work in several respects. First, previous work demonstrated similarity between human and model performance only at the level of summary statistics (Girshick et al., 2011; Burge et al., 2010b; Weiss et al., 2002; Stocker and Simoncelli, 2006). We demonstrate that a principled model, operating directly on image data, predicts the summary statistics, the distribution of estimates, and the trial-by-trial errors. Second, previous work showed that human observers behave as if their visual systems have encoded the task-relevant statistics of 2D natural images (Girshick et al., 2011). We show that human observers behave as if they have properly encoded the task-relevant joint statistics of 2D natural images and the 3D properties of natural scenes (also see (Burge et al., 2010b)). Third, previous work tested and modeled human performance with artificial stimuli only (Girshick et al., 2011; Burge et al., 2010b; Weiss et al., 2002; Stocker and Simoncelli, 2006). We test human performance with both natural and artificial stimuli. The dramatic, but lawful, changes in performance with natural stimuli highlight the importance of studies with the stimuli that visual systems evolved to process.

Materials and methods

Apparatus

The stereo images were presented with a ViewPixx Technologies ProPixx projector fitted with a 3D polarization filter. Left and right images were presented sequentially at a refresh rate of 120 Hz (60 Hz per eye) and with the same resolution of the two images (1920 × 1080 pixel). The observer was positioned 3.0 m from a 2.0 × 1.2 m Harkness Clarus 140 XC polarization maintaining projection screen. This viewing distance minimizes the potential influence of screen cues to flatness (e.g., blur). Human observers wore glasses with passive (linear) polarized filters to isolate the image for the left and right eyes. The observer’s head was stabilized with a chin- and forehead-rest. From this viewing position, the projection screen subtended 36° x 21° of visual angle. The disparity-specified distance created by this projection system matched to the distances measured in the original natural scenes. The projection display was linearized over 10 bits of gray level. The maximum luminance was 84 cd/m2. The mean luminance was set to 40% of the projection system’s maximum luminance.

Participants

Three human observers participated in the experiment; two were authors, and one was naïve about the purpose of the experiment. Informed consent was obtained from participants before the experiment. The research protocol was approved by the Institutional Review Board of the University of Pennsylvania and is in accordance with the Declaration of Helsinki.

Experiment

Human observers binocularly viewed a small region of a natural scene through a circular aperture (1° or 3° diameter) positioned 5 arcmin of disparity in front of the scene point along the cyclopean line of sight. Observers communicated their tilt estimate with a mouse-controlled probe. Each observer viewed 3600 unique natural stimuli (150 stimuli per tilt bin x 24 tilt bins) presented with each of two apertures in the experiment (7200 total). Natural stimuli were constrained to be binocularly visible (no half-occlusions), to have slants larger than 30°, to have distances between 5 m and 50 m, and to have contrasts between 5% and 40%. Each observer also viewed 1440 unique artificial stimuli (60 stimuli per tilt bin x 24 tilt bins) with two apertures (2880 total). Artificial stimuli (1/f noise and phase- and orientation-randomized plaids) were matched to the natural stimuli on multiple additional dimensions (tilt, slant, distance, and contrast). Natural stimuli were presented in 48 blocks of 150 trials each, and artificial stimuli were presented in 12 blocks of 240 trials each, with interleaved blocks using small and large apertures.

Data analysis

Tilt is a circular (angular) variable. We computed the mean, variance, and error using standard circular statistics. The circular mean is defined as τ¯=arg[R] where R=[τexp[jτ]]/N is the complex mean resultant vector. The circular variance is defined as var(τ)=1|R|. Estimation error e=arg[exp[j(τ^τ)]] is the circular distance between the tilt estimate and groundtruth.

Groundtruth tilt

Groundtruth tilt τ is computed from the distance data (range map r) co-registered to each natural image in the database. We defined groundtruth tilt tan1(yr/xr) as the orientation of the normalized range gradient (Marr, 1982). The range gradient was computed by convolving the distance data with a 2D Gaussian kernel having space constant σ and then taking the partial derivatives in the x and y image directions (Burge et al., 2016). For the results presented in this manuscript, groundtruth tilt was computed using a space constant of σ=3 arcmin; doubling this space constant does not change the qualitative results. The space constants correspond to kernel sizes of ~0.25°−0.50°.

Image cues to tilt

Image cues to tilt (disparity, luminance, and texture cues) were computed directly from the images. Like groundtruth tilt, image cues were defined as the orientation tan1(ycue/xcue) of the local disparity and luminance gradients. The local disparity gradient is computed from the disparity image, which is obtained from the left and right eye luminance images via standard local windowed cross-correlation (Burge et al., 2016; Tyler and Julesz, 1978; Banks et al., 2004). The window for cross-correlation had the same space constant as the derivative operator that was used to compute the gradient (see below). The texture cue to tilt is defined as the orientation of the major axis of the local amplitude spectrum of the luminance image. This texture cue is non-standard (but see [Fleming et al., 2011]). However, this texture cue is more accurate in natural scenes than traditional texture cues (Burge et al., 2016; Clerc and Mallat, 2002; Galasso and Lasenby, 2007; Malik and Rosenholtz, 1997; Massot and Hérault, 2008). For the main results presented in this manuscript, image cues were computed from the gradients using a space constant of σ= 6 arcmin; using the space constants to σ=3, 6, 9, or 12 arcmin does not change the qualitative results. The space constants correspond to kernel sizes of ~0.25°−1.0°.

Local luminance contrast

Luminance contrast was defined as the root-mean-squared luminance values within a local area weighted by a cosine window. Specifically, luminance contrast is C=[xA((I(x)I¯)/I¯)2W(x)]/xAW(x) where x is the spatial location, W is a cosine window with area A, and I¯=[xAI(x)W(x)]/xAW(x) is the local mean intensity.

The output-mapping problem

On each trial, human observers communicated their perceptual estimate τ^ by making a response τ^rsp with a mouse-controlled probe. Unfortunately, the responses are not guaranteed to equal the perceptual estimates. An output-mapping function τ^rsp=g(τ^) relates the response to the perceptual estimate, and an estimation function τ^=f(τ) relates the estimate to the groundtruth tilt of each stimulus. When responses are biased, it is hard to conclude whether the biases are due to the output-mapping function or to the estimation function. When responses are unbiased, a stronger case can be made that the human responses equal the perceptual estimates. To obtain unbiased responses τ^rsp=τ from biased estimates τ^τ, the output mapping function would have to equal exactly the inverse of a biased estimation function: g(.)=f1(.); this possibility seems unlikely and has no explanatory power. Thus, by Occam’s razor, unbiased responses imply unbiased output-mapping and estimation functions: τ^rsp=τ^=τ. Human responses to artificial stimuli were unbiased (Figure 2E), implying an unbiased output-mapping function. Assuming that the output-mapping function is stable across stimulus types, we conclude that the biased responses to natural stimuli accurately reflect biased perceptual estimates.

Monte Carlo simulations

To determine whether the model predictions are representative of randomly sampled natural stimuli, we simulated 1000 repeats of the experiment. On each repeat, we obtained a different sample of 3600 natural stimuli (150 in each tilt bin) from which we obtained 3600 optimal estimates. The samples are used to compute 95% confidence intervals on the model predictions, which are shown as the shaded regions in Figure 3A and Figure 4A.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
    The texture gradient equation for recovering shape from texture
    1. M Clerc
    2. S Mallat
    (2002)
    IEEE Transactions on Pattern Analysis and Machine Intelligence 24:536–549.
    https://doi.org/10.1109/34.993560
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
    Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
    1. D Marr
    (1982)
    New York: W H Freeman & Company.
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
    Eye Movements: A Window on Mind and Brain
    1. JB Pelz
    2. C Rothkopf
    (2007)
    Oculomotor behavior in natural and man-made environments, Eye Movements: A Window on Mind and Brain, 10.1016/B978-008044980-7/50033-1.
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53

Decision letter

  1. Jack L Gallant
    Reviewing Editor; University of California, Berkeley, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "The lawful imprecision of human surface tilt estimation in natural scenes" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Reviewing Editor Jack Gallant, and David Van Essen as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Mark Lescroart (Reviewer #1); Michael Landy (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The authors study the ability of human subjects to estimate surface tilt in natural images. They find (among other results) that humans are biased to estimate tilt at cardinal (vertical and horizontal) orientations and show that an image-computable Bayesian model makes estimates of tilt that are similar to human estimates. Both reviewers found the paper to be interesting and timely. Reviewer 2 made only minor comments, but reviewer 1 had some concerns and requested additional behavioral data. The reviewing editor is persuaded by these arguments.

Essential revisions:

1) The authors have found that variance (or noise) in tilt in the stimulus leads to less accurate estimation of tilt. However, the natural images used in this study are highly variable, so this result isn't all that surprising. The authors should analyze whether the model is predicting human performance well simply because some trials have more or less tilt variance than others. If this is the case, the result is much less interesting – variance (or noise) in a tilt should cause poorer tilt estimates. Similarly, alternative versions of the plots in Figure 3 should be generated with low-tilt-variance scenes, to see if the bias shows up as clearly.

2) The authors should validate their results using artificial images with more naturalistic textures (e.g., 1/f). In general, the authors should try to introduce variability into the artificial images of the same kind and magnitude as the variability in the natural images.

3) The authors should perform some control experiments to verify that the results hold for stimuli larger than 3 degrees. If possible, it would be good to verify these effects in complex natural scenes.

Reviewer #1:

The authors study the ability of human subjects to estimate surface tilt in natural images. They find (among other results) that humans are biased to estimate tilt at cardinal (vertical and horizontal) orientations and show that an image-computable Bayesian model makes estimates of tilt that are similar to human estimates.

This is an interesting and timely question, and the study is generally well executed. The figures are nice, the writing is clear, and the dataset the authors use (if it is to be shared) seems a useful contribution. However, I have some concerns about the experimental paradigm. None of these concerns alone are fatal flaws, but when combined with some of the results, they give me doubts about the impact of the rest of the results.

First, the artificial stimuli are perhaps too simplified (very regularly textured, wholly planar). This is not representative of studies of tilt estimation: several studies have had human subjects estimate surface orientation (or related quantities) of non-planar surfaces (e.g. Todd et al., 1996; Li and Zaidi, 2000; Norman et al., 2006). The highly simple nature of the artificial stimuli here creates several obvious differences between the natural and artificial images. Figure 8A shows that one such difference – tilt variance, present in the natural images but not in the artificial ones – accounts for all of the difference in mean tilt estimation accuracy between artificial and natural images. Stated another way, the authors have found that variance (or noise) in tilt in the stimulus leads to less accurate estimation of tilt. Note that this is not noise in the image cues or anything else – this is variance in the exact parameter that is being estimated. I may be missing something, but this particular result (which appears to be the biggest effect in the experiment) seems wholly expected to me. So, I am unimpressed by the conclusion statement: "The dramatic, but lawful, fall-off in performance with natural stimuli highlights the importance of performing studies with the stimuli visual systems evolved to process."

The large effect of tilt variance calls into question the size of other effects the authors report. Figure 8—figure supplement 1 shows that, for natural stimuli with low tilt variance, the bias toward estimating vertical (0 and 180 degree) tilts is greatly diminished (the count ratio between estimated and true instances of vertical tilt is very near to 1). (As a side note, Figure 8—figure supplement 1 is a critical figure and should appear in the main manuscript). It is also not clear how much tilt variance might be affecting the model's predictions of trial-to-trial errors; the authors should analyze whether the model is predicting human performance well simply because some trials have more or less tilt variance than others. If this is the case, the result is much less interesting – variance (or noise) in a tilt should cause poorer tilt estimates. Similarly, alternative versions of the plots in Figure 3 should be generated with low-tilt-variance scenes, to see if the bias shows up as clearly.

The other striking difference between the artificial and natural images is the extreme regularity of the textures in the artificial images (at least of the plaids shown in Figure 2). The authors also used 1/f noise as a texture in the artificial images – did human performance differ depending on whether the artificial stimuli were plaid or 1/f noise? In general, it seems that adding more types of variation to the artificial stimuli and assessing the effects of that variation would provide a good way to assess what sorts of variation make human performance look more like it does for natural images. I suspect that the authors plan to do this in future work, but I think it would substantially increase the impact of this work to include such data and analyses here.

Finally – I admit some disappointment with the choice of only showing 3 degree stimuli. To me, this lessens the impact of the work as well; a 3 degree image patch hardly constitutes a "scene". Thus, the following conclusion statement seems a bit of a reach: "We quantify performance in natural scenes and report that human tilt percepts are often neither accurate nor precise". Human estimates of tilt given full natural images (including much more context) would likely be better than the estimates reported here. I realize this is a very difficult problem, but eLife is also a broad, prestigous journal; studying tilt estimation in natural image patches may be a critical step on the way to studying tilt estimation in full scenes, but it also seems less broadly interesting.

Last, a few notes on the model: First, I am puzzled as to why the authors do not include model performance on their artificial stimuli, too. This seems to be a straightforward and easy test of the generality of the model. Second, it's not clear to me whether the "estimate cube" of optimal mappings between image cues and tilts is computed using some, all, or none of the same images that the subjects saw in the experiment. The authors should clarify this point.

I should note again that the concerns above are almost entirely about impact. I hesitate to reject an otherwise interesting and well-executed study on grounds that it's just not splashy enough. And there are several interesting and solid results in this paper. The fact that tilt variance is correlated with tilt angle in a large sample of natural scenes seems solidly supported and important. Modulo the questions I raised above, the MMSE model performance appears to provide a good match for human performance in the natural images. The persistent difference between errors estimating cardinal and oblique tilt, as well as the persistent bias to estimate horizontal tilt – both with matched tilt variance – are also interesting. Thus, I am on the fence about this paper, mostly because its impact seems marginal. I could be convinced to accept the paper with revisions or to reject the paper.

Reviewer #2:

This is a lovely paper, showing that a nonparametric Bayesian model of tilt estimation accounts startlingly well for human behavior in a tilt-estimation task. My comments are mainly about improving the clarity, not much more.

Introduction: Many of my comments are a result of reading it in (my) natural order, i.e., your page order with diversions to the Methods when needed. So, when I got here I wondered whether the patches to be judged were centered on the display or occluded in the position in the original images. That's never stated explicitly but implied by a figure that hasn't come up yet.

Introduction: You never motivate/justify pooling over tilt sign until much, much later, and so I was surprised you threw information away from the start. I wondered about it again for Figure 3 where, given that you provide disparity, the tilt sign ambiguity from pictorial cues should be alleviated.

Figure 2 et seq.: Why didn't you run the model on the artificial stimuli and show the model fits for those data points (or misfits, as the case may be)?

Subsection “Normative model”: The citation of Figure 6—figure supplement 2 here seems out of place. The analyses for this figure don't appear until the next page. Also, shouldn't all the supplementary figures be cited somewhere in the main text? I think a bunch aren't.

Subsection “Trial-by-trial Error: Is -> Are.

Figure 8B:Exactly what bin cutoffs did you use for blue vs. red here?

Subsection “Effect of Natural Depth Variation”: artificially -> artificial.

Subsection “Generality of Conclusions and Future Directions”: our -> are.

Subsection “Cue-combination with and without independence assumptions”: This reference to Figure 6—figure supplement 2, since the pooled/averaged model is not in the figure, but merely mentioned in its legend.

Subsectiion “Experiment”: More details please: What's your definition of contrast, refer to the figure to state what part of the patch they were supposed to judge, were the judged bins over 180 or 360 degrees (I only say this because Figure 1 leads the reader to believe that it's over 180 degrees only). Please show and give details about the two types of artificial stimuli. Do responses to them differ from one another?

Subsection “Groundtruth tilt”: "atan2" is MATLAB notation, I'd think. You might want to say what you mean there.

Subsection “Image cues to tilt”: I'd like more detail here as well. The disparity cue must be based on a definition of "local" and a restriction of cross-correlation shifts. The disparity gradient requires a scale. The disparity gradient doesn't have a tilt sign ambiguity, but the texture cue does. I'm not sure "patch size at half height" won't confuse people (you don't mean the viewed patch, but rather the patch after multiplying by the Gaussian window).

Figure 6—figure supplement 2: Actually, I'm rather surprised that the single-cue models (especially those other than disparity) perform as well as they do. It worries me that there are weird regularities in your database. You never say what your definition of luminance is exactly, but why should tilt be dependent on luminance? How are each cue binned? Are the same bins used for the single-cue models, so that those models have vastly smaller measured parameters?

Figure 6—figure supplement 2 legend, line 12: "…but better than the prior or…".

https://doi.org/10.7554/eLife.31448.029

Author response

Essential revisions:

1) The authors have found that variance (or noise) in tilt in the stimulus leads to less accurate estimation of tilt. However, the natural images used in this study are highly variable, so this result isn't all that surprising. The authors should analyze whether the model is predicting human performance well simply because some trials have more or less tilt variance than others.

We agree that this is an important issue. The estimate means and variances as a function of groundtruth tilt are largely robust to changes in tilt variance (see new Figure 9). We also now show that the model does predict the distribution of human errors with near-planar natural stimuli (new addition to Figure 8—figure supplement 1). These results indicate that model’s success at predicting performance is not due simply to the fact that natural stimuli had higher tilt variance on average.

If this is the case, the result is much less interesting – variance (or noise) in a tilt should cause poorer tilt estimates.

We agree that if our model predicted nothing other than an increase in overall estimate variance, the result would not be particularly interesting. However, the model predicts the pattern of estimate means and variances as a function of groundtruth tilt and the distributions of tilt errors. These are all non-trivial predictions.

Similarly, alternative versions of the plots in Figure 3 should be generated with low-tilt-variance scenes, to see if the bias shows up as clearly.

We have done so. We analyzed the subset of near-planar natural stimuli in the experimental dataset and present the results in Figure 8—figure supplement 1E. With low-tilt variance (i.e. near-planar) locations in natural scenes (i) the bias persists, (ii) human performance continues to be substantially different with natural and artificial stimuli, and (iii) human performance continues to be predicted by the model.

2) The authors should validate their results using artificial images with more naturalistic textures (e.g., 1/f). In general, the authors should try to introduce variability into the artificial images of the same kind and magnitude as the variability in the natural images.

We now present results (Figure 2—figure supplement 3) for the 1/f noise textures and two plaid textures separately (see below). The three textures yield similar results although the 1/f textures yield estimates with slightly higher variance. The main point to take from these plots is that the planar artificial stimuli produce very different patterns of results from the natural stimuli (e.g., the lack of bias and the different pattern of variances). Performance with these artificial stimuli is notably different than performance with the near-planar subset of natural stimuli.

3) The authors should perform some control experiments to verify that the results hold for stimuli larger than 3 degrees. If possible, it would be good to verify these effects in complex natural scenes.

As requested, we re-ran our experiment without apertures so that human observers had full-field views of each scene (36ºx21º). We added new Figure 2——figure supplement 5 that shows the estimate counts, means, variances, and conditional distributions for full-field viewing. The data with full-field viewing verifies that the results in the original experiment hold for stimuli larger than 3 degrees.

Reviewer #1:

The authors study the ability of human subjects to estimate surface tilt in natural images. They find (among other results) that humans are biased to estimate tilt at cardinal (vertical and horizontal) orientations and show that an image-computable Bayesian model makes estimates of tilt that are similar to human estimates.

This is an interesting and timely question, and the study is generally well executed. The figures are nice, the writing is clear, and the dataset the authors use (if it is to be shared) seems a useful contribution. However, I have some concerns about the experimental paradigm. None of these concerns alone are fatal flaws, but when combined with some of the results, they give me doubts about the impact of the rest of the results.

First, the artificial stimuli are perhaps too simplified (very regularly textured, wholly planar). This is not representative of studies of tilt estimation: several studies have had human subjects estimate surface orientation (or related quantities) of non-planar surfaces (e.g. Todd et al., 1996; Li and Zaidi, 2000; Norman et al., 2006).

We agree that not all studies have been performed with planar surfaces, but most have. We have clarified the writing to make this point more clear and we now cite Norman et al. (2006), an unfortunate omission in our original submission.

The highly simple nature of the artificial stimuli here creates several obvious differences between the natural and artificial images. Figure 8A shows that one such difference – tilt variance, present in the natural images but not in the artificial ones – accounts for all of the difference in mean tilt estimation accuracy between artificial and natural images. Stated another way, the authors have found that variance (or noise) in tilt in the stimulus leads to less accurate estimation of tilt. Note that this is not noise in the image cues or anything else – this is variance in the exact parameter that is being estimated. I may be missing something, but this particular result (which appears to be the biggest effect in the experiment) seems wholly expected to me. So, I am unimpressed by the conclusion statement: "The dramatic, but lawful, fall-off in performance with natural stimuli highlights the importance of performing studies with the stimuli visual systems evolved to process." The large effect of tilt variance calls into question the size of other effects the authors report.

It is true that the absolute error with natural stimuli is smaller when there is no tilt variance; and we agree that this effect is not particularly surprising. But even when natural stimuli have low tilt variance (i.e., are near-planar), performance with natural and artificial stimuli is not the same. Thus, tilt variance alone cannot explain all differences in human performance with natural and artificial stimuli. See new Figure 9 and modified Figure 8—figure supplement 1.

That being said, tilt variance is a pervasive performance impacting stimulus factor in natural scenes that has not been systematically characterized before, and we think it important to do so. We speculate that characterizing the impact of tilt variance is likely to be fundamental understanding 3D surface orientation estimation in the real-world, just as characterizing the impact of local luminance contrast has been important for understanding how humans detect spatial patterns in noisy uncertain backgrounds (Burgess et al., 1981).

We have reorganized the text to make more clear exactly what we are and what we are not claiming. We hope our efforts to improve clarity make for easier reading.

Figure 8—figure supplement 1 shows that, for natural stimuli with low tilt variance, the bias toward estimating vertical (0 and 180 degree) tilts is greatly diminished (the count ratio between estimated and true instances of vertical tilt is very near to 1). (As a side note, Figure 8—figure supplement 1 is a critical figure and should appear in the main manuscript). It is also not clear how much tilt variance might be affecting the model's predictions of trial-to-trial errors; the authors should analyze whether the model is predicting human performance well simply because some trials have more or less tilt variance than others. If this is the case, the result is much less interesting – variance (or noise) in a tilt should cause poorer tilt estimates. Similarly, alternative versions of the plots in Figure 3 should be generated with low-tilt-variance scenes, to see if the bias shows up as clearly.

The count ratio varies between ~0.5 and ~2.0 for low-tilt-variance natural stimuli (Figure 8—figure supplement 1B). Across all natural stimuli, the count ratio varied between ~0.5 and ~3.0. Also, the pattern of estimate bias persists with low-tilt-variance stimuli (Figure 8—figure supplement 1C). More importantly, we now show the distributions of estimation error for natural stimuli with low tilt variance (Figure 8—figure supplement 1E). There remain substantial differences in human performance with tilt-variance-matched natural and artificial stimuli.

All three human observers show significant trial-by-trial raw error correlations with the model for near-planar natural stimuli. Two of three observers show significant trial-by-trial bias corrected error correlations with the model for near-planar natural stimuli.

We have substantially re-written the Results section “Performance-impacting Stimulus Factors: Slant, Distance, & Natural Depth Variation” to make all these points more clear. We have included a new Figure 9, which shows that, although there is an effect of tilt variance, the basic performance patterns are robust to changes in tilt variance. We respectfully decline to move Figure 8—figure supplement 1into the main text as we think it breaks up the flow. We hope the changes we have made address the core of the reviewer’s concern.

The other striking difference between the artificial and natural images is the extreme regularity of the textures in the artificial images (at least of the plaids shown in Figure 2). The authors also used 1/f noise as a texture in the artificial images – did human performance differ depending on whether the artificial stimuli were plaid or 1/f noise? In general, it seems that adding more types of variation to the artificial stimuli and assessing the effects of that variation would provide a good way to assess what sorts of variation make human performance look more like it does for natural images. I suspect that the authors plan to do this in future work, but I think it would substantially increase the impact of this work to include such data and analyses here.

Human performance did not differ appreciably depending on the types of texture in artificial stimuli. Please see the plot above in Essential Revision #2. The variance of tilt estimates is slightly higher with 1/f stimuli, but the qualitative patterns are consistent across all three artificial stimulus textures.

We agree that studying the effects of the variation is an important question to address and we plan to do in the future. We believe a proper treatment of this question is a major undertaking in its own right and deserves its own manuscript.

Finally – I admit some disappointment with the choice of only showing 3 degree stimuli. To me, this lessens the impact of the work as well; a 3 degree image patch hardly constitutes a "scene". Thus, the following conclusion statement seems a bit of a reach: "We quantify performance in natural scenes and report that human tilt percepts are often neither accurate nor precise". Human estimates of tilt given full natural images (including much more context) would likely be better than the estimates reported here. I realize this is a very difficult problem, but eLife is also a broad, prestigous journal; studying tilt estimation in natural image patches may be a critical step on the way to studying tilt estimation in full scenes, but it also seems less broadly interesting.

Please see above Essential Revision #3. We have collected a new dataset with full scene stimuli, as the reviewer requested. We show the data from this new experiment in new figure (Figure 2-supplement 5) and discuss the result in a new Discussion section titled “Influence of full-field viewing”. Results are essentially unchanged with full-field viewing. While this result may seem surprising at first, it makes sense. Scene structure is correlated only over a relatively local area. Except for the ground plane, it is fairly unusual for surfaces to have a constant orientation over very large visual angles. Thus, scene locations farther than 3º are likely to add little additional information.

Last, a few notes on the model: First, I am puzzled as to why the authors do not include model performance on their artificial stimuli, too. This seems to be a straightforward and easy test of the generality of the model.

With artificial stimuli, the human estimate means are nicely predicted by the model, but the human estimates have higher circular variance than the model predicts, although the pattern is similar (see figure below). We do not yet understand the reason for this discrepancy, but we suspect that an explicit model of internal noise will be required. This is a topic we would like to reserve for future work.

Author response image 1
Model performance on artificial stimuli
https://doi.org/10.7554/eLife.31448.028

Second, it's not clear to me whether the "estimate cube" of optimal mappings between image cues and tilts is computed using some, all, or none of the same images that the subjects saw in the experiment. The authors should clarify this point.

The estimate cube includes the same scene locations that were presented in the experiment. However, the estimate cube was constructed from approximately 1 billion samples. Only 3600 of these samples were used as experimental stimuli, a negligible fraction of this total. Excluding the 3600 unique experimental stimuli from the estimate cube has no measurable influence on the model predictions. We now make this point in the Figure 5 caption.

I should note again that the concerns above are almost entirely about impact. I hesitate to reject an otherwise interesting and well-executed study on grounds that it's just not splashy enough. And there are several interesting and solid results in this paper. The fact that tilt variance is correlated with tilt angle in a large sample of natural scenes seems solidly supported and important. Modulo the questions I raised above, the MMSE model performance appears to provide a good match for human performance in the natural images. The persistent difference between errors estimating cardinal and oblique tilt, as well as the persistent bias to estimate horizontal tilt – both with matched tilt variance – are also interesting. Thus, I am on the fence about this paper, mostly because its impact seems marginal. I could be convinced to accept the paper with revisions or to reject the paper.

Reviewer #2:

This is a lovely paper, showing that a nonparametric Bayesian model of tilt estimation accounts startlingly well for human behavior in a tilt-estimation task. My comments are mainly about improving the clarity, not much more.

Introduction: Many of my comments are a result of reading it in (my) natural order, i.e., your page order with diversions to the Methods when needed. So, when I got here I wondered whether the patches to be judged were centered on the display or occluded in the position in the original images. That's never stated explicitly but implied by a figure that hasn't come up yet.

Thanks. We now state that the patches were displayed in their original positions.

Introduction: You never motivate/justify pooling over tilt sign until much, much later, and so I was surprised you threw information away from the start. I wondered about it again for Figure 3 where, given that you provide disparity, the tilt sign ambiguity from pictorial cues should be alleviated.

As you point out, it is true that the disparity cue can alleviate tilt sign ambiguity. We are currently working on a new model that makes use of it. Also, even though the disparity cue is provided in the stimuli, there are some sign confusions in the human response (e.g., see a weak pattern of data points on the lower-right quadrant in the Figure 2C scatter plot). These sign confusions complicate some of the data analyses. Furthermore, all three humans are remarkably consistent in their unsigned tilt estimation performance. That said these are important issues that we will tackle in our next piece of work.

Figure 2 et seq.: Why didn't you run the model on the artificial stimuli and show the model fits for those data points (or misfits, as the case may be)?

Please see response above.

Subsection “Normative model”: The citation of Figure 6—figure supplement 2 here seems out of place. The analyses for this figure don't appear until the next page.

Thank you. We reorganized the text so that the material appears in a more natural order. The old Figure S6 is now Figure 6—figure supplement 2, and it is referred to only after introducing the analysis of trial-by-trial errors.

Also, shouldn't all the supplementary figures be cited somewhere in the main text? I think a bunch aren't.

All of the supplementary figures were cited in the main text.

Subsection “Trial-by-trial Error: Is -> Are.

Thank you. Fixed.

Figure 8B:Exactly what bin cutoffs did you use for blue vs. red here?

Sorry. Tilt bins were 45º wide. For cardinal tilts, bins were centered on 0º and 90º. For oblique tilts, bins were centered on 45º and 135º. This information has now bin added to the Figure 8 caption.

Subsection “Effect of Natural Depth Variation”: artificially -> artificial.

Fixed.

Subsection “Generality of Conclusions and Future Directions”: our -> are.

Fixed.

Subsection “Cue-combination with and without independence assumptions”: This reference to Figure 6—figure supplement 2, since the pooled/averaged model is not in the figure, but merely mentioned in its legend.

Fixed. We removed the confusing reference.

Subsection “Experiment”: More details please: What's your definition of contrast.

Luminance contrast was defined as the root-mean-squared luminance values within a local area weighted by a cosine window. Specifically, luminance contrast is C=[xA((I(x)I¯)/I¯)2W(x)]/xAW(x) where x is the spatial location, W is a cosine window with area A, and I¯=[xAI(x)W(x)]/xAW(x)is the local mean intensity.

Refer to the figure to state what part of the patch they were supposed to judge.

Observers estimated the tilt at the center of the 1º (or 3º) patch marked by the smaller of the two probe circles. This is indicated in the Figure 2 caption.

Were the judged bins over 180 or 360 degrees (I only say this because Figure 1 leads the reader to believe that it's over 180 degrees only).

Observers estimated tilt across all 360º. The groundtruth were sampled from 24 bins, each with a width of 15º. Each bin has 150 samples. We analyzed all the data but focused the majority of our analyses on the unsigned tilts (i.e., 360º modulo 180º).

Please show and give details about the two types of artificial stimuli. Do responses to them differ from one another?

We generated (1) 1/f textured plane, (2) a “sparse” 3.5cpd plaid plane (i.e., a plane textured with the sum of two orthogonal sinusoidal gratings), and (3) a “dense” 5.25cpd plaid plane. These details are included in the main text. We now show examples of all three artificial stimulus types in new Figure 2-supplement 3. In the same figure, we also show performance for each artificial stimulus type separately. Performance with all three stimulus types is similar. Note that all artificial stimuli were matched to the tilt, slant, distance, and luminance contrast of each patch of natural scene.

Subsection “Groundtruth tilt”: "atan2" is MATLAB notation, I'd think. You might want to say what you mean there.

Thanks. The correction has been made.

Subsection “Image cues to tilt”: I'd like more detail here as well. The disparity cue must be based on a definition of "local" and a restriction of cross-correlation shifts. The disparity gradient requires a scale.

Disparity was estimated using windowed cross correlation. The window for the windowed cross-correlation had the same space constant as the derivative operator used to compute the gradient. This information has been added to the methods section.

The disparity gradient doesn't have a tilt sign ambiguity, but the texture cue does.

Correct. The disparity gradient does not have a tilt sign ambiguity, but in this paper we focused only on recovering the unsigned tilt. We are currently working on generalizing the model so that it can predict signed tilt. In the future work, we will use the signed tilt information provided by the disparity gradient.

I'm not sure "patch size at half height" won't confuse people (you don't mean the viewed patch, but rather the patch after multiplying by the Gaussian window).

Fixed.

Figure 6—figure supplement 2: Actually, I'm rather surprised that the single-cue models (especially those other than disparity) perform as well as they do. It worries me that there are weird regularities in your database. You never say what your definition of luminance is exactly, but why should tilt be dependent on luminance?

Why does the orientation of the luminance gradient carry information about tilt? We don’t have a solid grasp of the physics, but it was been previously reported (Potetz & Lee, 2003) that luminance and depth is weakly correlated. But we don’t really know.

The luminance signals that we used for the computations were proportional to candelas/m2. Luminance contrast does not depend on the absolute luminance.

How are each cue binned?

Each cue is binned into 64 unsigned tilts, and the same bins are used for the single-cue models. Thus, three-cue model is binned 643 bins.

Are the same bins used for the single-cue models, so that those models have vastly smaller measured parameters?

Yes, the single-cue models had fewer bins (i.e., fewer parameters). Increasing the number of single-cue bins does not improve performance. In other words, cue quantization error is not responsible for the single-cue performance.

Figure 6—figure supplement 2 legend, line 12: "… but better than the prior or…".

Fixed.

https://doi.org/10.7554/eLife.31448.030

Article and author information

Author details

  1. Seha Kim

    Department of Psychology, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    sehakim@upenn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0356-6168
  2. Johannes Burge

    Department of Psychology, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    jburge@sas.upenn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0311-7875

Funding

National Institutes of Health (EY011747)

  • Johannes Burge

University of Pennsylvania (Startup Funds)

  • Johannes Burge

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

Human subjects: Informed consent was obtained from participants before the experiment. The research protocol was approved by the Institutional Review Board of the University of Pennsylvania (IRB approval protocol number: 824435) and is in accordance with the Declaration of Helsinki.

Reviewing Editor

  1. Jack L Gallant, University of California, Berkeley, United States

Publication history

  1. Received: August 25, 2017
  2. Accepted: January 29, 2018
  3. Accepted Manuscript published: January 31, 2018 (version 1)
  4. Version of Record published: March 9, 2018 (version 2)

Copyright

© 2018, Kim et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 705
    Page views
  • 79
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Atsushi Kikumoto, Ulrich Mayr
    Research Article
    1. Neuroscience
    Jane Anne Horne et al.
    Tools and Resources Updated