Active sensing in the categorization of visual patterns

Version of Record

Accepted for publication after peer review and revision.

Download
Cite
Share
CommentOpen annotations (there are currently 0 annotations on this page).

Version of Record updated: February 6, 2017 (This version)
Version of Record updated: December 13, 2016 (Go to version)
Version of Record published: February 10, 2016 (Go to version)
Accepted: December 6, 2015
Received: October 10, 2015

1. Of interest
Somatotopic organization among parallel sensory pathways that promote a grooming sequence in Drosophila

Katharina Eichler, Stefanie Hampel ... Andrew M Seeds

Research Advance Apr 18, 2024
Further reading

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
References
Article and author information
Metrics

Abstract

Interpreting visual scenes typically requires us to accumulate information from multiple locations in a scene. Using a novel gaze-contingent paradigm in a visual categorization task, we show that participants' scan paths follow an active sensing strategy that incorporates information already acquired about the scene and knowledge of the statistical structure of patterns. Intriguingly, categorization performance was markedly improved when locations were revealed to participants by an optimal Bayesian active sensor algorithm. By using a combination of a Bayesian ideal observer and the active sensor algorithm, we estimate that a major portion of this apparent suboptimality of fixation locations arises from prior biases, perceptual noise and inaccuracies in eye movements, and the central process of selecting fixation locations is around 70% efficient in our task. Our results suggest that participants select eye movements with the goal of maximizing information about abstract categories that require the integration of information from multiple locations.

https://doi.org/10.7554/eLife.12215.001

eLife digest

To interact with the world around us, we need to decide how best to direct our eyes and other senses to extract relevant information. When viewing a scene, people fixate on a sequence of locations by making fast eye movements to shift their gaze between locations. Previous studies have shown that these fixations are not random, but are actively chosen so that they depend on both the scene and the task. For example, in order to determine the gender or emotion from a face, we fixate around the eyes or the nose, respectively.

Previous studies have only analyzed whether humans choose the optimal fixation locations in very simple situations, such as searching for a square among a set of circles. Therefore, it is not known how efficient we are at optimizing our rapid eye movements to extract high-level information from visual scenes, such as determining whether an image of fur belongs to a cheetah or a zebra.

Yang, Lengyel and Wolpert developed a mathematical model that determines the amount of information that can be extracted from an image by any set of fixation locations. The model could also work out the next best fixation location that would maximize the amount of information that could be collected. This model shows that humans are about 70% efficient in planning each eye movement. Furthermore, it suggests that the inefficiencies are largely caused by imperfect vision and inaccurate eye movements.

Yang, Lengyel and Wolpert’s findings indicate that we combine information from multiple locations to direct our eye movements so that we can maximize the information we collect from our surroundings. The next challenge is to extend this mathematical model and experimental approach to even more complex visual tasks, such as judging an individual’s intentions, or working out the relationships between people in real-life settings.

https://doi.org/10.7554/eLife.12215.002

Introduction

Several lines of evidence suggest that humans and other animals direct their sensors (e.g. their eyes, whiskers, or hands) so as to extract task-relevant information efficiently (Yarbus, 1967; Kleinfeld et al., 2006; Lederman and Klatzky, 1987). Indeed, in vision, the pattern of eye movements used to scan a scene depends on the type of information sought (Hayhoe and Ballard, 2005; Rothkopf et al., 2007), and has been implied to follow from an active strategy (Najemnik and Geisler, 2005; Renninger et al., 2007; Navalpakkam et al., 2010; Nelson and Cottrell, 2007; Toscani et al., 2013; Chukoskie et al., 2013) in which each saccade depends on the information gathered about the current visual scene and prior knowledge about scene statistics. However, until now, studies of such active sensing have either been limited to search tasks or to qualitative descriptions of the active sensing process. In particular, no studies have shown whether the information acquired by each individual fixation is being optimized. Rather, the fixation patterns have either been described without a tight link to optimality (Ballard et al., 1995; Epelboim and Suppes, 2001) or compared to an optimal strategy only through summary statistics such as the total number of eye movements and the distribution of saccade vectors (Najemnik and Geisler, 2005; 2008) that could have arisen through a heuristic. Therefore, these studies leave open the question as to what extent eye movements truly follow an active optimal strategy. In order to study eye movements in a more principled quantitative manner, we estimated the efficiency of eye movements in a high-level task on a fixation-by-fixation basis.

Here, we focus on a pattern categorization task that is fundamentally different from visual search, in which often there is a single location in the scene that has all the necessary information (the target), and eye movements are well described by the simple mechanism of inhibition of return (Klein, 2000). In contrast, in many other tasks, such as constructing the meaning of a sentence of written text, or judging from a picture how long a visitor has been away from a family (Yarbus, 1967), no single visual location has the necessary information in it and thus such tasks require more complex eye movement patterns. While some basic-level categorization tasks can be solved in a single fixation (Thorpe et al., 1996; Li et al., 2002), many situations require multiple fixations to extract several details at different locations to make a decision. Therefore, when people have to extract abstract information (eg., how long the visitor has been away) they need to integrate a series of detailed observations (such as facial expression, postures and gestures of the people) across the scene, relying heavily on foveal vision information with peripheral vision playing a more minor role (Levi, 2008).

We illustrate the key features of active sensing in visual categorization by a situation that requires the categorization of an animal based on its fur that is partially obscured by foliage (Figure 1). As each individual patch of fur can be consistent with different animals, such as a zebra or a cheetah, and the foliage prevents the usage of gist information (Oliva and Torralba, 2006), multiple locations have to be fixated individually, and the information accumulated across these locations, until a decision can be made with high confidence. For maximal efficiency, this requires a closed loop interaction between perception, which integrates information from the locations already fixated with prior knowledge about the prevalence of different animals and their fur patterns, and thus maintains beliefs about which animal might be present in the image, and the planning of eye movement, which should direct the next fixation at a location which has potentially the most information relevant for the categorization (Figure 1). Inspired by this example, and to allow a mathematically tractable quantification of the information at any fixation location, we designed an experiment with visual patterns that were statistically well-controlled and relatively simple, while ensuring that foveal vision would dominate by using a gaze-contingent categorization task in which we tracked the eye and successively revealed small apertures of the image at each fixation location. In contrast to previous studies (Najemnik and Geisler, 2005; Renninger et al., 2007; Peterson and Eckstein, 2012; Morvan and Maloney, 2012), our task required multiple locations to be visited to extract information about abstract pattern categories.

Figure 1

Download asset Open asset

Active sensing involves an interplay between perception and action.

When trying to categorize whether a fur hidden behind foliage (left) belongs to a zebra or a cheetah, evidence from multiple fixations (blue, the visible patches of the fur, and their location in the image) needs to be integrated to generate beliefs about fur category (right, here represented probabilistically, as the posterior probability of the particular animal given the evidence). Given current beliefs, different potential locations in the scene will be expected to have different amounts of informativeness with regard to further distinguishing between the categories, and optimal sensing involves choosing the maximally informative location (red). In the example shown, after the first two fixations (blue) it is ambiguous whether the fur belongs to a zebra or a cheetah, but active sensing chooses a collinearly located revealing position (red) which should be informative and indeed reveals a zebra with high certainty. Note that this is just an illustrative example.

https://doi.org/10.7554/eLife.12215.003

Results

Categorization performance and eye movement patterns

We generated images of three types: patchy, horizontal stripy, and vertical stripy (Figure 2A). Participants had to categorize each image pattern as patchy or stripy (disregarding whether a stripy image was horizontal or vertical—the inclusion of two different stripy image types prevented participants from solving the task based on one image axis alone). The images were generated by a flexible statistical model that could generate many examples from each of the three image types, so that the individual pixel values varied widely even within a type and only higher order statistical information (ie. the length scale of spatial correlations) could be used for categorization. We first presented the participants with examples of full images to familiarize them with the statistics of the image types and to ensure their categorization with full images was perfect. We then switched to an active gaze-contingent mode in which the entire pattern was initially occluded by a black mask and the underlying image was revealed with a small aperture at each fixation location (Figure 2B; for visibility, the black mask is shown as white). As a control, we also used a number of passive revealing conditions in which the revealing locations were chosen by the computer rather than in a gaze-contingent manner. In all conditions, we controlled the number of revealings on each trial before requiring the participants to categorize the image (Figure 2B). To ensure that participants had equal chance to extract information from all revealing locations in the passive as well as the active conditions we allowed the participants to rescan the revealed locations after the final revealing (see also Materials and methods for full rationale). Importantly, in the active revealing condition, even though rescanning was allowed after the final revealing, participants had to select all revealing locations in real time without knowing how many revealings they would be allowed on a given trial. Therefore, although rescanning could improve categorization it was unlikely to influence participants’ active revealing strategy. To confirm this we also performed a control in which rescanning was not allowed (see below).

Figure 2 with 1 supplement see all

Download asset Open asset

Image categorization task and participants’ performance.

(A) Example stimuli for each of the three image types sampled from two-dimensional Gaussian processes. (B) Experimental design. Participants started each trial by fixating the center cross. In the free-scan condition, an aperture of the underlying image was revealed at each fixation location. In the passive condition, revealing locations were chosen by the computer. In both conditions, after a random number of revealings, participants were required to make a category choice (patchy, P, versus stripy, S) and were given feedback. (C) Categorization performance as a function of revealing number for each of the three participants (symbols and error bars: mean $\pm$ SEM across trials), and their average, under the free-scan and passive conditions corresponding to different revealing strategies. Lines and shaded areas show across-trial mean $\pm$ SEM for the ideal observer model. Figure 2—figure supplement 1 shows categorization performance in a control experiment in which no rescanning was allowed.

https://doi.org/10.7554/eLife.12215.004

In the free-scan (active) condition, categorization performance improved with the number of revealings for each participant (Figure 2C, red points), indicating the successful integration of information across a large number of fixations. Although we used a gaze-contingent display, the task still allowed participants to employ natural every-day eye movement strategies, consistent with the inter-saccadic intervals (mean 408 ms) and relative saccade size (0.75–3.9 normalized by the three different relevant length scales of the stimuli) that were similar though somewhat shorter than those recorded for everyday activities (e.g. tea making; mean inter-saccadic interval 497 ms, and 0.95–19 relative saccade size normalized by the size of different fixated objects; Hayhoe et al., 2003; Land and Tatler, 2009).

To examine whether fixation locations depended on the underlying image patterns, we constructed revealing density maps for each image type. To account for the translation-invariant statistics of the underlying images of a type, we used the relative location of revealings obtained by subtracting from the absolute location of each revealing (measured in screen-centered coordinates) the center of mass of the absolute locations within each trial (Materials and methods). In order to compare eye movement patterns across conditions and participants, we subtracted the mean density map across all images for each participant. Importantly, we found that the pattern of revealings strongly depended on image type (Figure 3A, first four rows). The pattern of eye movements for images of the same type were positively correlated for each participant (Figure 3B, left, orange bars; p<0.001 in all cases), whereas eye movements for images of different types were negatively correlated (Figure 3B, left, purple bars; p<0.001 in all cases). We also found that eye movement patterns became increasingly differentiated over the course of the trial as progressively more of the image was revealed (Figure 3B, left, curves). The dependence of the eye movement patterns on the underlying image type shows that participants employed an active sensing strategy.

Figure 3 with 3 supplements see all

Download asset Open asset

Density maps of relative revealing locations and their correlations.

(A) Revealing density maps for participants and BAS. Last three columns show mean-corrected revealing densities for each of the three underlying image types (removing the mean density across image types, first column). Bottom: color scales used for all mean densities (left), and for all mean-corrected densities (right). All density maps use the same scale, such that a density of 1 corresponds to the peak mean density across all maps. Figure 3—figure supplement 1 shows revealing density maps obtained for participants in a control experiment in which no rescanning was allowed. Figure 3—figure supplement 2 shows the measured saccadic noise that was incorporated into the BAS simulations. Figure 3—figure supplement 3 shows density maps separately for correct and incorrect trials. (B) The curves are correlations for individual participants as a function of revealing number with their own maps (left) and the maps generated by BAS (right). The bars are correlations at 25 revealing (see Materials and methods). Orange shows within image type correlation, ie. correlation between revealing densities obtained for images of the same type, and purple shows across image type correlation. Data are represented as mean $\pm$ SD for the curves and mean $\pm$ 95% confidence intervals for the bars.

https://doi.org/10.7554/eLife.12215.006

To assess whether this active strategy used by our participants contributed to performance improvement, we examined the same participants in a passive revealing condition. When the revealings were drawn randomly from an isotropic Gaussian centered on the image, performance was substantially impaired (Figure 2C, blue points), indicating that participants’ decision performance benefited from their active sensing strategy. After the final revealing, participants rescanned only briefly, on average for 5.0 s in the active condition, 6.4 s in the passive random condition before making a decision. Moreover, their performance did not improve with increased rescanning time, instead it correlated negatively with rescanning time for 2 out of 3 participants, such that the probability of a correct decision for rescanning times at the 25th and 75th percentiles of the rescanning time distribution fell from 0.77 to 0.54 (p<0.001) and from 0.77 to 0.66 (p<0.03), respectively. This suggests that longer rescanning times indicate when participants are uncertain rather than providing the main source of information for their decisions. This is in contrast to additional revealings during the original scanning period which clearly benefit performance when chosen appropriately (Figure 2C).

To further examine whether the rescanning period after the final revealing affected the participants scanning strategy, we examined additional participants in a free-scan condition in which no rescanning was allowed after the final revealing (the display blanked). The revealing density maps for these participants were very similar to the maps of participants who were allowed to rescan (average within-type vs. across-type correlation across the two groups of participants: 0.63 vs. −0.30) and performance was also similar (Figure 2—figure supplement 1 and Figure 3—figure supplement 1), although as expected slightly worse without rescanning. The proportion correct across all trials for the participants who were allowed to rescan was 0.65, 0.66 and 0.69 (average 0.66) and for those not allowed to rescan was 0.64, 0.58 and 0.66 (average 0.63). This confirms that allowing rescanning did not substantially change participants’ revealing strategy.

Bayesian ideal observer

We constructed an ideal observer (Geisler, 2011) which computed a posterior distribution, $ℙ (c | D)$ , over image category $c$ given the observations D (collection of previous revealing locations and revealed pixel values in the trial) and made a choice such that the category with higher posterior probability was more likely to be selected (see Materials and methods). To construct this model, we considered three sources of suboptimality: prior biases implying imperfect knowledge of the precise correlation length scales of each pattern category, perception noise that distorts the displayed pixel values, and decision noise which occasionally results in selecting the category with the lower posterior probability (Houlsby et al., 2013). We fitted six models to the individual choices of each participant. These models differed in the kind of prior bias and decision noise they included, and we selected the model with the strongest statistical evidence, as quantified by the Bayesian information criterion (Tables 1–2). The best model (4 parameters in total) provided a close match to the participants’ performance both in the active and passive random-revealing conditions (Figure 2C, red and blue lines). Crucially, this model also allowed us to estimate the beliefs that participants held about image categories at any point in a trial, which was necessary for determining the optimal next eye movement that could maximally disambiguate between the categories.

Table 1

Maximum likelihood parameters of the model (see Materials and methods for details) with the best BIC score (see Table 2).

https://doi.org/10.7554/eLife.12215.010

Participant	Perception noise, σ_p	Prior bias, Δ	Decision noise
Participant	Perception noise, σ_p	Prior bias, Δ	Stimulus-dependent, β	Stimulus-independent, κ
1	0.5	0.58°	1.4	0.044
2	0.5	0.61°	1.9	0.12
3	0.3	0.54°	1.5	0.10

Table 2

Model comparison results using Bayesian information criterion (BIC, lower is better). Each row is a different model using a different combination of included (+) and excluded (–) parameters (columns, see Materials and methods for details). Last column shows BIC score relative to the BIC of the best model (number 4).

https://doi.org/10.7554/eLife.12215.011

Model	Perception noise, σ_p	Prior bias		Decision noise		BIC
Model	Perception noise, σ_p	Scale, α	Offset, Δ	Stimulus-dependent, β	Stimulus-independent, κ	BIC
1	+	–	–	+	–	160
2	+	–	–	+	+	139
3	+	–	+	+	–	58
4	+	–	+	+	+	0
5	+	+	–	+	–	105
6	+	+	–	+	+	102

Predicting eye movement patterns by a Bayesian active sensor algorithm

To be able to rigorously assess how close our participants were to optimal sensing, we developed a Bayesian active sensor (BAS) algorithm which is optimal in minimizing categorization error with every single revealing. That is, for our task, the aim of BAS is to choose the next fixation location, $x^{*}$ , so as to maximally reduce uncertainty in the category (MacKay, 1992). This objective is formalized by the BAS score function which expresses the expected information gain when choosing $x^{*}$ , and which can be conveniently computed as:

Score (x^{*} | D) = H [z^{*} | x^{*}, D] - {⟨ H [z^{*} | x^{*}, c, D] ⟩}_{ℙ (c | D)}

where $H$ denotes entropy (a measure of uncertainty), $z^{*}$ is the possible pixel value at $x^{*}$ , $D$ is the collection of revealing locations and revealed pixel values that have been observed in the trial as above, $c$ is image category, and $⟨ \cdot ⟩$ denotes averaging over the two categories weighted by their posterior probabilities, as computed by the ideal observer (for more details, see Materials and methods). This expresses a trade-off between two terms. The first term encourages the selection of locations, $x^{*}$ , where we have the most overall uncertainty about the pixel value, $z^{*}$ , while the second term prefers locations for which our expected pixel value for each category is highly certain.

Figure 4A shows a sequence of fixations on a representative trial. On each fixation, the BAS score is computed for all possible positions (grayscale map) based on all previous fixation locations (green dots) and the pixel values revealed there. While the BAS algorithm would choose the position with the highest BAS score as the next fixation location (blue crosses), the participant might choose a different, suboptimal, fixation location (yellow circles). Nevertheless, the informativeness of most of the participant’s fixation locations were very high as expressed by their information-percentile values (the percentage of putative fixation locations with lower BAS scores than the one chosen by the participant).

Figure 4 with 1 supplement see all

Download asset Open asset

Example trial of the Bayesian active sensor (BAS) and its maximum entropy variant.

(A) The operation of BAS in a representative trial for saccades 1–8 and 14 (underlying image shown top left). For each fixation (left, panels), BAS computes a score across the image (gray scale, Equation 1). This indicates the expected informativeness of each putative fixation location based on its current belief about the image type, expressed as a posterior distribution (inset, lower left), which in turn is updated at each fixation by incorporating the new observation of the pixel value at that fixated location. Crosses show the fixation locations with maximal score for each saccade, green dots show past fixation locations chosen by the participant and yellow circle shows current fixation location. Percentage values (bottom right) show their information percentile values (the percentage of putative fixation locations with lower BAS scores than the one chosen by the participant). Histogram on the right shows distribution of percentile values across all participants, trials and fixations. (B) Predictions of the maximum entropy variant (the first term in Equation 1) as in (A). For saccades 1–3, the fixation locations with maximal score (crosses) are not shown because the maxima comprise a continuous region near the edge of the image instead of discrete points. Note that entropy can be maximal further (eg. fixation 4) or nearer the edges of the image (eg. fixation 1), depending on the tradeoff between the two additive components defining it: the BAS score, which tends to be higher near revealing locations (panel A), and uncertainty due to the stochasticity of the stimulus and perception noise, which tends to be greater away from revealing locations. Figure 4—figure supplement 1 shows two illustrative examples for this trade-off.

https://doi.org/10.7554/eLife.12215.012

We simulated eye movement patterns derived by BAS for the same images shown to our participants. In order to take into account basic biological constraints on the accuracy of eye movements, we included saccadic variability and bias in the model based on measurements made independently in a group of participants which took into account both the standard deviation and bias of saccades both along and orthogonal (standard deviation only) to the desired saccade direction as a function of desired amplitude (Figure 3—figure supplement 2). The predicted (mean-corrected) pattern of eye movements closely matched those observed (Figure 3A, last two rows): they were positively correlated with participants’ eye movements for the same image type (Figure 3B, right, orange bars; p $<$ 0.001 in all cases), but negatively correlated with those for different image types (Figure 3B, right, purple bars; p<0.001 in all cases). These differences increased as a function of revealing number (Figure 3B, right, curves). Moreover, when we split participants’ trials into those in which they made a correct or incorrect decision, the pattern of eye movements derived from the correct trials correlated better with the BAS pattern than that derived from incorrect trials (Figure 3—figure supplement 3, $ρ_{correct} = 0.74$ , $ρ_{incorrect} = 0.20$ , p<0.001 for the average participant), further suggesting that following a BAS-like strategy was beneficial for performance.

For comparison, we also analyzed how well participants’ fixations could be accounted for by a strategy using a variant of the score that only included the first term in Equation 1 and thus selected locations with maximal entropy rather than maximal information gain. We found that this strategy provided a substantially poorer fit to our eye movement data than the full BAS algorithm, as measured by the distribution of the scores corresponding to actual fixation locations (Figure 4B) and the anti-correlations between predicted and actual revealing maps at 25 revealings ( $ρ$ = $-$ 0.62, $-$ 0.51, $-$ 0.43; all p<0.001).

Fixation informativeness

In order to obtain a fixation-by-fixation measure of the informativeness of participants’ individual eye movements, we used the ideal observer model to quantify the amount of information accumulated about image category over subsequent revealings in a trial for different revealing strategies (Figure 5A). This information-based measure is more robust than directly measuring distances between optimal and actual scan paths because multiple locations are often (nearly) equally informative when planning the next saccade. For example, in the trial shown in Figure 4A, the BAS score map was clearly multimodal for several fixations, and some of the participant’s fixation locations were indeed distant from the corresponding locations that BAS would have chosen, yet their informativeness was generally very high. Therefore, categorization performance is better understood in terms of information measures on the revealed locations rather than by the geometry of individual scan paths, which has traditionally been used in previous studies (Najemnik and Geisler, 2005; Renninger et al., 2007).

Figure 5 with 1 supplement see all

Download asset Open asset

Information gain as a function of revealing number for different strategies.

(A) Cumulative information gain of an ideal observer (matched to participants’ prior bias and perceptual noise) with different revealing strategies (black, green, and blue) and participants’ own revealings (red). Data are represented as mean $\pm$ SEM across trials. Figure 5—figure supplement 1 shows a measure of efficiency extracted from these information curves across sessions. (B) Information gains for three heuristic strategies (See text for details, and Materials and methods): posterior-independent & order-dependent fixations (orange), posterior-dependent & order-independent fixations (purple), and posterior- & order-dependent fixations (brown). The information gain curves for the three heuristics overlap in all cases. Participants’ active revealings (red lines, as in A) were 1.81 (95% CI, 1.68–1.94), 1.85 (95% CI, 1.72–1.99), and 1.92 (95% CI, 1.74–2.04) times more efficient in gathering information than these heuristics, respectively. Data are represented as mean $\pm$ SEM across trials.

https://doi.org/10.7554/eLife.12215.014

We compared the information efficiency of different revealing strategies by measuring the relative number of revealings required by them to gain the same amount of information. Participants’ active revealings (Figure 5A–B, red lines) were 2.93 (95% confidence interval [CI], 2.60–3.32) times more efficient than random revealings (Figure 5A, blue lines). As a more stringent control than a random strategy, we also simulated eye movement patterns for three heuristic strategies that reproduced different aspects of the statistics of participants’ actual eye movement patterns but lacked the full closed-loop Bayesian interaction between belief updating and eye-movement selection (see Materials and methods for details). We first used a feed-forward strategy, which retained order-dependence, i.e. the way the statistics of revealing locations chosen by the participants depended on revealing number, but ignored participants’ inferences about the category (Figure 5B, orange). The second heuristic took into account the participant’s belief about the underlying image, but not the revealing number (Figure 5B, purple). The third heuristic was a partial closed-loop strategy, that thus respected both the belief- and order-dependence of revealings, but not the details of previous revealings (Figure 5B, brown line). Notably, this strategy would be optimal for simpler stimuli and tasks, such as visual search for a target, but not for our spatially correlated stimuli and task requiring information integration across multiple locations. Participants’ revealings were 1.81–1.92 times more efficient than these three heuristic strategies.

However, participants’ active revealings were less efficient by a factor of 2.48 (95% CI, 2.33–2.62) than the information provided by revealings generated by the BAS algorithm with ‘idealized’ parameters: minimal perception noise (2–3 times lower than our participants’; see Materials and methods) and no prior biases or saccadic inaccuracies (Figure 5A, black lines). Indeed, this efficiency in information gathering was also reflected in our participants’ performance when they viewed revealings generated by this ideal BAS (Figure 2C, black points). In this condition, participants only rescanned the revealed locations for a short amount of time after the final revealing (average of 4.4 s), less than in the active and passive random conditions, thus their increased efficiency could not be attributed to longer rescanning times. The discrepancies between participants’ and BAS’s revealings may be caused by participants employing an inefficient strategy to select their fixation locations, or, alternatively, they may be due to more trivial factors upstream or downstream of the process responsible for selecting the next fixation, such as noise and variability in perception or execution, respectively. Importantly, when we computed the information gain provided by BAS when operating with participants’ prior biases, perceptual noise and the typical saccadic inaccuracies as described above, the discrepancy between the informativeness of BAS-generated revealings compared to that of participants’ revealings was markedly reduced (Figure 5A, green lines, BAS/free-scan efficiency = 1.45, 95% CI, 1.37–1.53). This suggests that the central component of choosing where to fixate was around 70% efficient in our participants, and a large component of suboptimality arose due to other processes. To examine the role of learning in participants’ active sensing strategy, we computed the relative efficiency of each of the free-scan sessions of our experiment, spanning multiple days, compared to the final session (Figure 5—figure supplement 1). We found that efficiency remained stable over the whole course of the experiment, suggesting there was minimal, if any, learning required in the free-scan task.

Discussion

Our results show that humans employ an active sensing strategy in deciding where to look in a high-level pattern categorization task. In our task, participants’ patterns of eye movements were well predicted by a Bayes-optimal algorithm seeking to maximize information gain relevant to the task with each individual eye movement. This Bayes optimal strategy involved finding the location in the scene which when fixated was most likely to lead to the greatest reduction in categorization error, rather than simply the location associated with the most uncertainty about pixel value.

The efficiency of active sensing in human vision

Although our participants performed better when revealings were chosen by the BAS algorithm than with their own scan paths, our results suggest that this suboptimality in participants’ performance was to a large part due to prior biases, perception noise, and saccadic inaccuracies constraining the selection of fixation locations rather than an inefficient active sensing strategy. In particular, our participants’ eye movements were substantially more efficient than heuristic strategies that only employed a subset of the elements of a fully closed-loop active sensing strategy, and were about 70% as efficient as the optimal active sensor that operated with the participants’ own prior bias, perception noise, and saccadic inaccuracies. Importantly, this estimate does not conflate the inference process with the selection process in that even if the participant’s inference is biased or inconsistent, we can measure, given those beliefs, the extent to which they select fixation locations optimally. The 30% inefficiency we found may be due to unmodeled constraints in human eye movement strategies, such as the biases for cardinal directions or fixating the centre of an image that were apparent in our data (Figure 3A) and that may also be beneficial in natural environments (Tatler and Vincent, 2008; 2009). Such suboptimalities are conceptually different from those that we factored out using the specific biases and noise included in the BAS model. The latter are suboptimalities that arise in the execution of a planned saccade (as in Figure 3—figure supplement 2), while the former could be suboptimalities of the planning itself. Therefore, as we were interested in the degree of (sub)optimality of the planning component, we chose not to factor out potential biases for cardinal directions or locations. Should such suboptimalities turn out to be part of the execution process, then our estimate of 70% would become a lower bound on the efficiency of the planning process.

Our results contrast with recent studies suggesting that the pattern of eye movements do not follow an active sensing strategy. In one study, using a simple object localization task, it was found that the choice of fixation locations was close to random despite obvious learning of the underlying stimulus statistics (Holm et al., 2012). In another study, using a simplified visual search task, participants’ eye movements were virtually unaffected by the configuration of the visual stimuli presented in the trial (Morvan and Maloney, 2012). In contrast, we found that participants used both their prior knowledge of stimulus statistics as well as evidence accumulated about the current visual scene to guide their eye movements. We speculate that better performance in our case may be due to our more naturalistic task, which involves the extraction of abstract latent features, revealing more typical processing of the sensorimotor system.

Relevance for natural vision

Although our task was designed to emulate many features of natural vision, there remain several differences. First, our gaze-contingent display involved exposing small patches of the image at each fixation whereas in natural vision information is available from the full visual field, albeit with decreasing acuity away from the fovea. If participants’ selection process was aware of and actively optimized for the changed field of vision in our task, the resulting eye movement patterns could be different from those in natural vision. Nevertheless, the basic summary statistics of (macro-)saccades in our task (average size and inter-saccadic intervals) were similar to those found in natural vision (Hayhoe et al., 2003; Land and Tatler, 2009), and there was a lack of adaptation to the task over the course of several days of free-scan sessions (Figure 5—figure supplement 1). These seem to suggest, at least indirectly, that participants did not depart dramatically from their natural eye movement strategies.

Second, our task focuses on the voluntary component of eye movements, that is the scan paths of fixations, rather than the more involuntary processes of micro-saccades and drift (Rolfs, 2009; Ko et al., 2010; Rucci et al., 2007; Poletti et al., 2013; Kuang et al., 2012), which did not trigger new revealings. Importantly, in our task these involuntary processes are likely to be used for extracting information within each revealed patch, whereas high-level categorization required macro-saccades. This is because these involuntary processes cover an area with a standard deviation (SD) on the order of 0.22° (over the course of each fixation; Rolfs, 2009), which is similar to our Gaussian apertures (SD of 0.18°), whereas the smallest length scale of our stimuli (0.91°) is 4 times larger. However, it is possible that in natural scenes with finer structure, micro-saccades and drift may contribute more to abstract feature extraction.

Third, our stimuli set had strictly stationary statistics. This is an approximation to natural image statistics that are often assumed to be spatially stationary on average (eg. Field, 1987) but usually include local non-stationarities (eg. different objects with different characteristic spatial frequencies). Nevertheless, our stimuli were more naturalistic than many of the stimuli used in active sensing studies of visual search (eg. simple 1/f noise) while still allowing a rigorous control and measurement of the amount of high-level category information available at any potential revealing location which would not have been possible with natural scenes.

Finally, natural vision works under time constraints. Our task limited the number of revealings in each trial rather than the time, although the temporal aspects of the eye movements (inter-saccadic intervals) were similar to eye movements with natural scenes. We also showed in a control experiment that the additional time allowed for rescanning the revealed locations did not affect the initial scanning strategy.

A challenge for future studies will be to employ visual tasks that are more naturalistic in these respects while still retaining our ability to quantify the task-relevant information available to participants at any point in a trial. It may well be that participants are more efficient in such naturalistic settings than the 70% we found in our task.

Relation to earlier work

Our approach makes several important contributions. First, by using a gaze-contingent display we were able to isolate the top-down strategies of eye movement selection, thereby complementing studies which examined the influence of low-level visual features and salience (Itti and Koch, 2000; Wismeijer and Gegenfurtner, 2012) on eye movement selection. In addition, our task focuses on integrating low-level visual information into an abstract category thus complementing studies that examine the effect of target value (Navalpakkam et al., 2010; Krajbich et al., 2010; Markowitz et al., 2011; Schutz et al., 2012) or more cognitive tasks that use eye movement (Nelson and Cottrell, 2007) or button-pressing for information search (Castro et al., 2009; Gureckis and Markant, 2009; Borji and Itti, 2013). Some of these studies used a similar Bayesian formalism to obtain the active learning strategies specific to their tasks and showed that humans perform active learning when given the opportunity to consciously deliberate from which location in a scene they should gather information next. In contrast, our work shows that high-efficiency active sensing is a natural strategy for eye movements that humans adopt without overt deliberation, as evidenced by the inter-saccadic intervals matching naturalistic tasks and the fact that our participants reached near-asymptotic efficiency already in the first session of our task.

Second, by using a pattern categorization task we could ensure that no single location was especially informative but that the task required an integration of information from multiple locations both to select the next eye movement and to solve the task, and specifically that a closed-loop strategy was necessary for solving the task efficiently. This is in contrast with studies that attempted to quantify the general informativeness of single locations in a scene (Gosselin and Schyns, 2001), and showed that they are the target of fixations humans tend to choose in general (Peterson and Eckstein, 2012; Toscani et al., 2013; Chukoskie et al., 2013), or visual search in which, by the nature of the task, the target location is fully informative by itself (Najemnik and Geisler, 2005). As such, most previous studies have not addressed the important interplay between information gathering and fixation selection characteristic of naturalistic active sensing: that, in general, the most informative location is ever-changing, dependent on the history of fixations one has already performed.

Third, in contrast to active sensing for simple visual search (Najemnik and Geisler, 2005; Navalpakkam et al., 2010; Morvan and Maloney, 2012), our formalism extends the range of active sensing to tasks which have arbitrary, not necessarily spatial, latent features (such as categories). In particular, it provides the first fixation-by-fixation analysis of information gathering under active eye movements by carefully matching our observer model to participants’ performance. As a result, for the first time, we were able to dissociate the contributions of the eye-movement selection process from those of perceptual, motor, and decision processes, identify predominant sources of apparent sub-optimality in active sensing, and quantify the efficiency of choosing each individual fixation throughout scanning a whole scene. Taken together, these features make our approach amenable to multiple tasks in different perceptual domains (Kleinfeld et al., 2006), as well as high-level cognitive tasks such as estimating the age or socio-economic status of people in a scene (Yarbus, 1967).

Materials and methods

Participants

Three naive participants (aged 25–35 years, none of them were authors or neuroscientists) took part in the experiment. All participants were neurologically healthy, had normal or corrected to normal vision and gave their informed consent before participating. The study was approved by the institutional ethics committee. Each experiment took approximately 12 hr across 6 days (2 hr per day). As this experiment was particularly laborious we focus on within-participant analysis.

Share this article

Cite this article

Active sensing involves an interplay between perception and action.

Image categorization task and participants’ performance.

Density maps of relative revealing locations and their correlations.

Example trial of the Bayesian active sensor (BAS) and its maximum entropy variant.

Information gain as a function of revealing number for different strategies.

Author details

Scott Cheng-Hsin Yang

Contribution

For correspondence

Competing interests

Máté Lengyel

Contribution

Contributed equally with

Competing interests

Daniel M Wolpert

Contribution

Contributed equally with

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading