Feature-based and property-based visual tasks

(A) Feature-based visual tasks. Most visual tasks involve making decisions based on looking at specific features. Face recognition is shown here as an example. According to standard theories of decision making, such tasks are solved in the brain by setting up a decision variable in a multidimensional feature space (arrow), and making decisions based on whether the value of the decision variable is larger or smaller than a decision boundary (dashed line).

(B) Property-based visual tasks. By contrast, some tasks involve detecting properties in the image, such as a same-different task (illustrated using faces; top row), detecting an oddball item (illustrated using middle row) or judging if an object is symmetric (bottom row). These tasks cannot be solved by looking for any specific feature. As a result, such tasks do not fit into standard models of decision making since the underlying feature space and decision variable are unknown.

Solving oddball search and symmetry tasks using visual homogeneity.

(A) Example target-present search display, containing a single oddball target (horse) among identical distractors (dog). Participants in such tasks have to indicate whether the display contains an oddball or not, without knowing the features of the target or distractor. This means they have to perform this task by detecting some property of each display rather than some feature contained in it.

(B) Example target-absent search display containing no oddball target.

(C) Hypothesized neural computation for target present/absent judgements. According to multiple object normalization, the response to multiple items is an average of the responses to the individual items. Thus, the response to a target-absent array will be identical to the individual items, whereas the response to a target-present array will lie along the line joining the corresponding target-absent arrays. This causes the target-absent arrays to stay apart (red lines), and the target-present arrays to come closer due to mixing (blue lines). If we calculate the distance (VH, for visual homogeneity) for each display, then target-absent arrays will have a larger distance to the center (VHa) compared to target-present arrays (VHp), and this distance can be used to distinguish between them. Inset: Schematic distance from center for target-absent arrays (red) and target-present arrays (blue). Note that this approach might only reflect the initial target selection process involved in oddball visual search but does not capture all forms of visual search.

(D) Example asymmetric object in a symmetry detection task. Here too, participants have to indicate if the display contains a symmetric object or not, without knowing the features of the object itself. This means they have to perform this task by detecting some property in the display.

(E) Example symmetric object in a symmetry detection task.

(F) Hypothesized neural computations for symmetry detection. Following multiple object normalization, the response to an object containing repeated parts is equal the response to the individual part, whereas the response to an object containing two different parts will lie along the line joining the objects with the two parts repeating. This causes symmetric objects to stand apart (red lines) and asymmetric objects to come closer due to mixing (blue lines). Thus, the visual homogeneity for symmetric objects (VHs) will be larger than for asymmetric objects (VHa). Inset: Schematic distance from center for symmetric objects (red) and asymmetric objects (blue).

(G) Behavioral predictions for VH. If visual homogeneity (VH) is a decision variable in visual search and symmetry detection tasks, then response times (RT) must be largest for displays with VH close to the decision boundary. This predicts opposite correlations between response time and VH for the present/absent or symmetry/asymmetry judgements. It also predicts zero overall correlation between VH and RT.

(H) Neural predictions for VH. Left: Correlation between brain activations and VH for two hypothetical brain regions. In the VH-encoding region, brain activations should be positively correlated with VH. In any region that encodes task difficulty as indexed by response time, brain activity should show no correlation since VH itself is uncorrelated with RT (see Panel G). Right: Correlation between brain activations and RT. Since VH is uncorrelated with RT overall, the region VH should show little or no correlation, whereas the regions encoding task difficulty would show a positive correlation.

Visual homogeneity predicts target present/absent responses

(A) Example search array in an oddball search task (Experiment 1). Participants viewed an array containing identical items except for an oddball present either on the left or right side, and had to indicate using a key press which side the oddball appeared. The reciprocal of average search time was taken as the perceptual distance between the target and distractor items. We measured all possible pairwise distances for 32 grayscale natural objects in this manner.

(B) Perceptual space reconstructed using multidimensional scaling performed on the pairwise perceptual dissimilarities. In the resulting plot, nearby objects represent hard searches, and far away objects represent easy searches. Some images are shown at a small size due to space constraints; in the actual experiment, all objects were equated to have the same longer dimension. The correlation on the top right indicates the match between the distances in the 2D plot with the observed pairwise distances (**** is p < 0.00005).

(C) Example display from Experiment 2. Participants performed this task inside the scanner. On each trial, they had to indicate whether an oddball target is present or absent using a key press.

(D) Predicted response to target-present and target-absent arrays, using the principle that the neural response to multiple items is the average of the individual item responses. This predicts that target-present arrays become similar due to mixing of responses, whereas target-absent arrays stand apart. Consequently, these two types of displays can be distinguished using their distance to a central point in this space. We define this distance as visual homogeneity, and it is obtained by finding the optimum center that maximizes the difference in correlations with response times (see Methods).

(E) Mean visual homogeneity relative to the optimum center for target-present and target-absent displays. Error bars represent s.e.m across all displays. Asterisks represent statistical significance (**** is p < 0.00005, unpaired rank-sum test comparing visual homogeneity for 32 target-absent and 32 target-present arrays).

(F) Response time for target-present searches in Experiment 2 plotted against visual homogeneity calculated from Experiment 1. Asterisks represent statistical significance of the correlation (**** is p < 0.00005). Note that a single model is fit to find the optimum center in representational space that predicts the response times for both target-present and target-absent searches.

(G) Response time for target-absent searches in Experiment 2 plotted against visual homogeneity calculated from Experiment 1. Asterisks represent statistical significance of the correlation (**** is p < 0.00005).

A localized brain region encodes visual homogeneity

A. Searchlight map showing the correlation between mean activation in each 3×3×3 voxel neighborhood and visual homogeneity.

B. Searchlight map showing the correlation between neural dissimilarity in each 3×3×3 voxel neighborhood and perceptual dissimilarity measured in visual search.

C. Key visual regions identified using standard anatomical masks: early visual cortex (EVC), area V4, lateral occipital (LO) region. The visual homogeneity (VH) region was identified using the searchlight map in Panel A.

D. Correlation between the mean activation and visual homogeneity in key visual regions EVC, V4, LO and VH. Error bars represent standard deviation of the correlation obtained using a boostrap process, by repeatedly sampling participants with replacement for 10,000 times. Asterisks represent statistical significance, estimated by calculating the fraction of bootstrap samples in which the observed trend was violated (* is p < 0.05, ** is p< 0.01, **** is p < 0.0001).

E. Correlation between neural dissimilarity in key visual regions with perceptual dissimilarity. Error bars represent the standard deviation of correlation obtained using a bootstrap process, by repeatedly sampling participants with replacement 10,000 times. Asterisks represent statistical significance, estimated by calculating the fraction of bootstrap samples in which the observed trend was violated (** is p < 0.001).

Visual homogeneity predicts symmetry perception

(A) Example search array in Experiment 3. Participants viewed an array containing identical items except for an oddball present either on the left or right side, and had to indicate using a key press which side the oddball appeared. The reciprocal of average search time was taken as the perceptual distance between the target and distractor items. We measured all possible pairwise distances for 64 objects (32 symmetric, 32 asymmetric) in this manner.

(B) Perceptual space reconstructed using multidimensional scaling performed on the pairwise perceptual dissimilarities. In the resulting plot, nearby objects represent hard searches, and far away objects represent easy searches. Some images are shown at a small size due to space constraints; in the actual experiment, all objects were equated to have the same longer dimension. The correlation on the top right indicates the match between the distances in the 2D plot with the observed pairwise distances (**** is p < 0.00005).

(C) Two example displays from Experiment 4. Participants had to indicate whether the object is symmetric or asymmetric using a key press.

(D) Using the perceptual representation of symmetric and asymmetric objects from Experiment 3, we reasoned that they can be distinguished using their distance to a center in perceptual space. The coordinates of this center are optimized to maximize the match to the observed symmetry detection times.

(E) Visual homogeneity relative to the optimum center for asymmetric and symmetric objects. Error bar represents s.e.m. across images. Asterisks represent statistical significance (* is p < 0.05, unpaired rank-sum test comparing visual homogeneity for 32 symmetric and 32 asymmetric objects).

(F) Response time for asymmetric objects in Experiment 4 plotted against visual homogeneity calculated from Experiment 3. Asterisks represent statistical significance of the correlation (** is p < 0.001).

(G) Response time for symmetric objects in Experiment 4 plotted against visual homogeneity calculated from Experiment 3. Asterisks represent statistical significance of the correlation (* is p < 0.05).

Brain region encoding visual homogeneity during symmetry detection

(A) Searchlight map showing the correlation between mean activation in each 3×3×3 voxel neighborhood and visual homogeneity.

(B) Searchlight map showing the correlation between neural dissimilarity in each 3×3×3 voxel neighborhood and perceptual dissimilarity measured in visual search.

(C) Key visual regions identified using standard anatomical masks: early visual cortex (EVC), area V4, Lateral occipital (LO) region. The visual homogeneity (VH) region was identified using searchlight map in Panel A.

(D) Correlation between the mean activation and visual homogeneity in key visual regions EVC, V4, LO and VH. Error bars represent standard deviation of the correlation obtained using a boostrap process, by repeatedly sampling participants with replacement for 10,000 times. Asterisks represent statistical significance, estimated by calculating the fraction of bootstrap samples in which the observed trend was violated (* is p < 0.05, ** is p< 0.01, **** is p < 0.0001).

(E) Correlation between neural dissimilarity in key visual regions with perceptual dissimilarity. Error bars represent the standard deviation of correlation obtained using a bootstrap process, by repeatedly sampling participants with replacement 10,000 times. Asterisks represent statistical significance, estimated by calculating the fraction of bootstrap samples in which the observed trend was violated (** is p < 0.001).

Visual homogeneity in deep networks predicts oddball search

(A) Correlation between perceptual dissimilarities and deep network dissimilarities across 32 oddball search experiments shown for each layer of ResNet-50 (median ± sem calculated across experiments). The layer with highest median correlation (r = 0.46) is marked with a blue dashed line. This layer is taken for further analyses.

(B) Predicted visual homogeneity (calculated from ResNet-50 layer 134) for target-present and target-absent searches. Error bars represent s.e.m across all displays. Asterisks represent statistical significance (**** is p < 0.00005, unpaired rank-sum test comparing visual homogeneity for 32 target-absent and 32 target-present arrays).

(C) Observed response time for target-present searches in Experiment 2 plotted against visual homogeneity calculated from ResNet-50 layer 134. Asterisks represent statistical significance of the correlation (**** is p < 0.00005). Note that a single model is fit to find the optimum center that predicts the response times for both target-present and target-absent searches.

(D) Same as (C) but for target-absent searches.

Additional analysis for Experiment 1

(E) Reaction times of target-absent searches (Experiment 2) plotted against the average dissimilarity to all other objects (Experiment 1).

(F) Visual homogeneity for each object plotted against the average distance of each object to all other objects, suggesting that visual homogeneity is closely related to the average distance of an object to all others.

(G) Correlation between predicted and observed target-present RT as a function of the weight of target relative to distractors in the search array. The analysis in the main text assumes that the responses to target-present arrays are an average of the target and distractor responses. To validate this assumption, we repeated the analysis by assuming taking target-present array response to be rarray = Wrtarget + (1 − w) ∗ rdistractor, where rarray is the response to the target-present array, rtarget and rdistracctor are the responses to the target and distractor, and w represents the weight of the target relative to the distractor. If w = 0, it means that the target does not contribute to the overall response, and w = 1 implies that the target dominates the overall response. In this plot, for each value of w, we optimized the coordinates of the center to best match the data, and plotted the correlation between predicted and observed target-present RT. It can be seen that roughly equal weighting (w = 0.52) of the target and distractor yields the best fit to the data. The gray bar represent the range of weights for which the correlation is statistically significant (p < 0.05).

Generalization to other objects

(A) Response time for target present/absent responses in Experiment S2 (involving a larger set of natural objects) plotted against visual homogeneity calculated from Experiment S1.

(B) Response time for target present/absent responses in Experiment S4 (involving silhouettes) plotted against visual homogeneity calculated from Experiment S3.

Target absent/present responses predict same/different responses.

(A) Example target-absent trial from the visual search task (Experiment 2).

(B) Example target-present trial from the visual search task (Experiment 2).

(C) Example “same” trial from the same-different task (Experiment S5), matched exactly to the target-absent trial in panel A.

(D) Example “different” trial from the same-different task (Experiment S5), matched exactly to the target-present trial in panel B.

(E) Response time on “same” trials in Experiment S5 plotted against response time for the corresponding target-absent trials from Experiment 1.

(F) Response time on “different” trials in Experiment S5 plotted against response time for the corresponding target-present trials from Experiment 1.

Target-absent responses are unaffected by mixing disparate searches

(A) Response times in the Mixed Block plotted against the corresponding response times in the Animal Block for present searches (blue), and absent searches (red).

(B) Response time in Mixed Block plotted against the corresponding response times in the Silhouette-only Block, with conventions as in panel A.

Target-absent responses are unaffected by disparate object context

(A) 2D embedding of the 49 silhouettes based in the pairwise dissimilarities (1/RT) measured using odd-ball visual search experiment (Experiment S3). Shapes are coloured according to the set to which they are grouped: red for shapes common to set 1 & set 2, blue for shapes only in set 1, and green for shapes only in set 2.

(B) Average dissimilarity of the common items relative to items in Set 2 plotted against the average dissimilarity relative to items in Set 1. If visual homogeneity depends on the average distance to other objects in the immediate experimental context, then target-absent responses for the common objects should be uncorrelated when presented in a block containing Set 1 items compared to a block containing Set 2 items.

(C) Absent search response times for the common items in Block 2 (containing Set 2 items) against the corresponding search times in Block 1 (containing Set 1 items). The strong and significant correlation indicates that target-absent search times are independent of the immediate experimental context.

Brain activations for target-present and target-absent searches.

Whole brain colormap of activation difference between target-present and target-absent searches. The color at each voxel represents the t-statistic computed between the participant-wise mean activations for target-present minus target-absent searches (averaged across searches of each type, and across a 3×3×3 voxel neighborhood centered around that voxel).

Robustness of VH region in target present/absent search

(A) Searchlight map showing the correlation between visual homogeneity and mean activation of an example subject.

(B) Colormap representing the number of subjects for which a particular voxel belonged to the localized VH region.

(C) Colormap of the correlation between visual homogeneity and mean activation across eight subjects in Group 1.

(D) VH region obtained by thresholding the searchlight map in panel C.

(E) Correlation between visual homogeneity and mean activation for participants in Group 1 for the VH region identified from Group 1, and for the VH region identified from Group 2. Asterisks indicate statistical significance of each correlation, obtained by sampling participants with replacement 10,000 times, and calculating the fraction of times the correlation was below zero (**** is p < 0.0005). The significant correlation in Group 2 for the region identified using Group 1 participants suggest that the VH region is consistently localized across subjects.

(F) Searchlight map similar to panel C, but for participants in Group 2.

(G) VH region obtained by thresholding the searchlight map in panel F.

(H) Correlation between visual homogeneity and mean activation for participants in Group 2, for the VH regions identified from Group 1 and from Group 2. Asterisks indicate statistical significance of each correlation, obtained by sampling participants with replacement 1000 times, and calculating the fraction of times the correlation was below zero (***is p < 0.005). The significant correlation in Group 1 for the region identified using Group 2 participants suggest that the VH region is consistently localized across subjects.

Searchlight maps with response times

(A) Colormap of correlation between mean activation and response times for target-present search arrays.

(B) Colormap of correlation between mean activation and response times for target-absent search arrays.

(C) Correlation between mean activation and response times for both target-present and target-absent search arrays. To prevent image-wise correlations from being confounded by overall activation level differences, we z-scored the mean activations for each voxel within a particular search type (present/absent) and then combined the mean activations. Likewise, for similar reasons, we z-scored the response times for each particular search type.

(D) Correlation between mean activation and all response times for key visual regions.Error bars represent the standard deviation of correlation obtained using a bootstrap process by repeatedly sampling participants with replacement 10,000 times. Asterisks represent statistical significance, estimated by calculating the fraction of bootstrap samples in which the observed trend was violated (* is p < 0.05, ** is p < 0.01, **** is p < 0.0001)

Relative weights of target and distractor in target-present arrays

(A) Colormap of correlation between observed and predicted voxel activity of the linear voxel model, in which target-present search array response is modelled as a linear combination of target and distractor activity (taken from responses to target-absent arrays).

(B) Region showing good model prediction, obtained by thresholding the colormap in (A).

(C) Target and distractor model coefficients in this region. Each point corresponds to model coefficients derived from a single voxel. Model coefficients for target and distractor are equal in weight (p = 0.33, sign-rank test across weights of 222 voxels in this region).

Brain activations for asymmetric and symmetric objects

Whole brain colormap of activation difference between asymmetric and symmetric objects during the symmetry task. The color at each voxel represents the t-statistic computed between the participant-wise mean activations for asymmetric minus symmetric objects (averaged across objects of each type, and across a 3×3×3 voxel neighborhood centered around that voxel).

Robustness of VH region in symmetry detection

(A) Searchlight map showing the correlation between visual homogeneity and mean activation of an example subject.

(B) Colormap representing the number of subjects for which a particular voxel belonged to the localized VH region.

(C) Colormap of the correlation between visual homogeneity and mean activation across eight subjects in Group 1.

(D) VH region obtained by thresholding the searchlight map in panel C.

(E) Correlation between visual homogeneity and mean activation for participants in Group 1 for the VH region identified from Group 1, and for the VH region identified from Group 2. Asterisks indicate statistical significance of each correlation, obtained by sampling participants with replacement 1000 times, and calculating the fraction of times the correlation was below zero (*** is p < 0.005). The significant correlation in Group 2 for the region identified using Group 1 participants suggest that the VH region is consistently localized across subjects.

(F) Searchlight map similar to panel C, but for 7 participants in Group 2.

(G) VH region obtained by thresholding the searchlight map in panel F.

(H) Correlation between visual homogeneity and mean activation for participants in Group 2, for the VH regions identified from Group 1 and from Group 2. Asterisks indicate statistical significance of each correlation, obtained by sampling participants with replacement 1000 times, and calculating the fraction of times the correlation was below zero (*** is p < 0.005). The significant correlation in Group 1 for the region identified using Group 2 participants suggest that the VH region is consistently localized across subjects.

Searchlight maps for response time during symmetry task

(A) Colormap of correlation between mean activation and response times for asymmetric objects.

(B) Colormap of correlation between mean activation and response times for symmetric objects.

(C) Correlation between mean activation and response times across both asymmetric and symmetric objects. To prevent image-wise correlations from being confounded by overall activation level differences, we z-scored the mean activations for each voxel within a particular object type (asymmetric/symmetric) and then combined the mean activations. Likewise, for similar reasons, we z-scored the response times for each particular object type before combining.

(D) Correlation between mean activation and all response times for key visual regions. Asterisks indicate statistical significance calculated in the same way as Figure 4D.

Comparing the VH region from Experiments 2 & 4.

Key visual regions identified using standard anatomical masks: early visual cortex (EVC), area V4, lateral occipital (LO) region. The VH region from the present/absent search task (Experiment 2, Figure 4C) is overlaid with the VH region identified from the symmetry task (Experiment 4, Figure 6C).

Target-absent search times predict symmetry detection

(A) Example search array from Experiment S7.

(B) Example display containing a symmetric object from Experiment 4.

(C) Response times for asymmetric objects in Experiment 4 plotted against their target-absent response time inverse in Experiment S7.

(D) Response times for symmetric objects in Experiment 4 plotted against their target-absent response time inverse in Experiment S7.

Visual homogeneity predicts categorization times

(A) Example trial of the animal categorization task. Stimuli was presented for 50 ms followed by a noise mask. Subjects responded if the presented image is an animal or not with Y/N key responses.

(B) Average categorization times for animals plotted against visual homogeneity relative to an optimum center for this task, calculated from oddball detection task data (Experiment S1).

(C) Same as panel B but for inanimate objects.

(D) Example trial of the dog categorization task.

(E) Average categorization times for dogs plotted against visual homogeneity relative to an optimum center for this task, calculated from oddball detection task data (Experiment S1).

(F) Same as panel E but for non-dogs.

(G) Example trial of the Labrador categorization task.

(H) Average categorization times for dogs plotted against visual homogeneity relative to an optimum center for this task, calculated from oddball detection task data (Experiment S1).

(I) Same as panel H but for non-Labradors.

Target-absent search predicts visual homogeneity for each category.

(A) Inverse of target-absent search times from Experiment S2 plotted against the optimized visual homogeneity from the animate task.

(B) Inverse of target-absent search times from Experiment S2 but now plotted against the optimized visual homogeneity from the dog task.

(C) Inverse of target-absent search times from Experiment S2 but now plotted against the optimized visual homogeneity from the Labrador task.