The view-tolerance of human identity recognition depends on horizontal face information

Alexia Roux-Sibilon; Hélène Dumont; Vincent Bremhorst; Christianne Jacobs; Valérie Goffaux

doi:10.7554/eLife.108495.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Tessa Dekker
University College London, London, United Kingdom
Senior Editor
Joshua Gold
University of Pennsylvania, Philadelphia, United States of America

Reviewer #1 (Public review):

Summary:

The authors describe the results of a single study designed to investigate the extent to which horizontal orientation energy plays a key role in supporting view-invariant face recognition. The authors collected behavioral data from adult observers who were asked to complete an old/new face matching task by learning broad-spectrum faces (not orientation filtered) during a familiarization phase and subsequently trying to label filtered faces as previously seen or novel at test. This data revealed a clear bias favoring the use of horizontal orientation energy across viewpoint changes in the target images. The authors then compared different ideal observer models (cross-correlations between target and probe stimuli) to examine how this profile might be reflected in the image-level appearance of their filtered images. This revealed that a model looking for the best matching face within a viewpoint differed substantially from human data, exhibiting a vertical orientation bias for extreme profiles. However, a model forced to match targets to probes at different viewing angles exhibited a consistent horizontal bias in much the same manner as human observers.

Strengths:

I think the question is an important one: The horizontal orientation bias is a great example of a low-level image property being linked to high-level recognition outcomes, and understanding the nature of that connection is important. I found the old/new task to be a straightforward task that was implemented ably and that has the benefit of being simple for participants to carry out and simple to analyze. I particularly appreciated that the authors chose to describe human data via a lower-dimensional model (their Gaussian fits to individual data) for further analysis. This was a nice way to express the nature of the tuning function, favoring horizontal orientation bias in a way that makes key parameters explicit. Broadly speaking, I also thought that the model comparison they include between the view-selective and view-tolerant models was a great next step. This analysis has the potential to reveal some good insights into how this bias emerges and ask fine-grained questions about the parameters in their model fits to the behavioral data.

Weaknesses:

I will start with what I think is the biggest difficulty I had with the paper. Much as I liked the model comparison analysis, I also don't quite know what to make of the view-tolerant model. As I understand the authors' description, the key feature of this model is that it does not get to compare the target and probe at the same yaw angle, but must instead pick a best match from candidates that are at different yaws. While it is interesting to see that this leads to a very different orientation profile, it also isn't obvious to me why such a comparison would be reflective of what the visual system is probably doing. I can see that the view-specific model is more or less assuming something like an exemplar representation of each face: You have the opportunity to compare a new image to a whole library of viewpoints, and presumably it isn't hard to start with some kind of first pass that identifies the best matching view first before trying to identify/match the individual in question. What I don't get about the view-tolerant model is that it seems almost like an anti-exemplar model: You specifically lack the best viewpoint in the library but have to make do with the other options. Again, this is sort of interesting and the very different behavior of the model is neat to discuss, but it doesn't seem easy to align with any theoretical perspective on face recognition. My thinking here is that it might be useful to consider an additional alternate model that doesn't specifically exclude the best-matching viewpoint, but perhaps condenses appearance across views into something like a prototype. I could even see an argument for something like the yaw-averages presented earlier in the manuscript as the basis for such a model, but this might be too much of a stretch. Overall, what I'd like to see is some kind of alternate model that incorporates the existence of the best-match viewpoint somehow, but without the explicit exemplar structure of the view-specific model.

Besides this larger issue, I would also like to see some more details about the nature of the cross-correlation that is the basis for this model comparison. I mostly think I get what is happening, but I think the authors could expand more on the nature of their noise model to make more explicit what is happening before these cross-correlations are taken. I infer that there is a noise-addition step to get them off the ceiling, but I felt that I had to read between the lines a bit to determine this.

Another thing that I think is worth considering and commenting on is the stimuli themselves and the extent to which this may limit the outcomes of their behavioral task. The use of the 3D laser-scanned faces has some obvious advantages, but also (I think) removes the possibility for pigmentation to contribute to recognition, removes the contribution of varying illumination and expression to appearance variability, and perhaps presents observers with more homogeneous faces than one typically has to worry about. I don't think these negate the current results, but I'd like the authors to expand on their discussion of these factors, particularly pigmentation. Naively, surface color and texture seem like they could offer diagnostic cues to identity that don't rely so critically on horizontal orientations, so removing these may mean that horizontal bias is particularly evident when face shape is the critical cue for recognition.

https://doi.org/10.7554/eLife.108495.1.sa1

Reviewer #2 (Public review):

This study investigates the visual information that is used for the recognition of faces. This is an important question in vision research and is critical for social interactions more generally. The authors ask whether our ability to recognise faces, across different viewpoints, varies as a function of the orientation information available in the image. Consistent with previous findings from this group and others, they find that horizontally filtered faces were recognised better than vertically filtered faces. Next, they probe the mechanism underlying this pattern of data by designing two model observers. The first was optimised for faces at a specific viewpoint (view-selective). The second was generalised across viewpoints (view-tolerant). In contrast to the human data, the view-specific model shows that the information that is useful for identity judgements varies according to viewpoint. For example, frontal face identities are again optimally discriminated with horizontal orientation information, but profiles are optimally discriminated with more vertical orientation information. These findings show human face recognition is biased toward horizontal orientation information, even though this may be suboptimal for the recognition of profile views of the face.

One issue in the design of this study was the lowering of the signal-to-noise ratio in the view-selective observer. This decision was taken to avoid ceiling effects. However, it is not clear how this affects the similarity with the human observers.

Another issue is the decision to normalise image energy across orientations and viewpoints. I can see the logic in wanting to control for these effects, but this does reflect natural variation in image properties. So, again, I wonder what the results would look like without this step.

Despite the bias toward horizontal orientations in human observers, there were some differences in the orientation preference at each viewpoint. For example, frontal faces were biased to horizontal (90 degrees), but other viewpoints had biases that were slightly off horizontal (e.g., right profile: 80 degrees, left profile: 100 degrees). This does seem to show that differences in statistical information at different viewpoints (more horizontal information for frontal and more vertical information for profile) do influence human perception. It would be good to reflect on this nuance in the data.

https://doi.org/10.7554/eLife.108495.1.sa0

Author response:

Reviewer #1 (Public review):

Summary:

The authors describe the results of a single study designed to investigate the extent to which horizontal orientation energy plays a key role in supporting view-invariant face recognition. The authors collected behavioral data from adult observers who were asked to complete an old/new face matching task by learning broad-spectrum faces (not orientation filtered) during a familiarization phase and subsequently trying to label filtered faces as previously seen or novel at test. This data revealed a clear bias favoring the use of horizontal orientation energy across viewpoint changes in the target images. The authors then compared different ideal observer models (cross-correlations between target and probe stimuli) to examine how this profile might be reflected in the image-level appearance of their filtered images. This revealed that a model looking for the best matching face within a viewpoint differed substantially from human data, exhibiting a vertical orientation bias for extreme profiles. However, a model forced to match targets to probes at different viewing angles exhibited a consistent horizontal bias in much the same manner as human observers.

Strengths:

I think the question is an important one: The horizontal orientation bias is a great example of a low-level image property being linked to high-level recognition outcomes, and understanding the nature of that connection is important. I found the old/new task to be a straightforward task that was implemented ably and that has the benefit of being simple for participants to carry out and simple to analyze. I particularly appreciated that the authors chose to describe human data via a lower-dimensional model (their Gaussian fits to individual data) for further analysis. This was a nice way to express the nature of the tuning function, favoring horizontal orientation bias in a way that makes key parameters explicit. Broadly speaking, I also thought that the model comparison they include between the view-selective and view-tolerant models was a great next step. This analysis has the potential to reveal some good insights into how this bias emerges and ask finegrained questions about the parameters in their model fits to the behavioral data.

We thank the reviewer for their positive appraisal of the importance of our research question as well as of the soundness of our approach to it.

Weaknesses:

I will start with what I think is the biggest difficulty I had with the paper. Much as I liked the model comparison analysis, I also don't quite know what to make of the view-tolerant model. As I understand the authors' description, the key feature of this model is that it does not get to compare the target and probe at the same yaw angle, but must instead pick a best match from candidates that are at different yaws. While it is interesting to see that this leads to a very different orientation profile, it also isn't obvious to me why such a comparison would be reflective of what the visual system is probably doing. I can see that the view-specific model is more or less assuming something like an exemplar representation of each face: You have the opportunity to compare a new image to a whole library of viewpoints, and presumably it isn't hard to start with some kind of first pass that identifies the best matching view first before trying to identify/match the individual in question. What I don't get about the view-tolerant model is that it seems almost like an anti-exemplar model: You specifically lack the best viewpoint in the library but have to make do with the other options. Again, this is sort of interesting and the very different behavior of the model is neat to discuss, but it doesn't seem easy to align with any theoretical perspective on face recognition. My thinking here is that it might be useful to consider an additional alternate model that doesn't specifically exclude the best-matching viewpoint, but perhaps condenses appearance across views into something like a prototype. I could even see an argument for something like the yaw-averages presented earlier in the manuscript as the basis for such a model, but this might be too much of a stretch. Overall, what I'd like to see is some kind of alternate model that incorporates the existence of the best-match viewpoint somehow, but without the explicit exemplar structure of the view-specific model.

The view-tolerant model was designed so that identity needed to be abstracted away from variations in yaw to support face recognition. We believe this model aligns with the notion of tolerant recognition.

The tolerance of identity recognition is presumably empowered by the internal representation of the natural statistics of identity, i.e. the stable traits and (idiosyncratic) variability of a face, which builds up through the varied encounters with a given face (Burton, Jenkins et al. 2005, Burton, Jenkins and Schweinberger 2011, Jenkins and Burton 2011, Jenkins, White et al. 2011, Burton, Kramer et al. 2016, Menon, Kemp and White 2018).

The average of various images of a face provides its appearance distribution (i.e., variability) and central tendency (i.e., stable properties; Figure 1) and could be used as a reasonable proxy of its natural statistical properties (Burton, Jenkins et al. 2005). We thus believe that the alternate model proposed by the reviewer is relevant to existing theories of face identity recognition and agree that our current model observers do not fully capture this aspect. It is thus an excellent idea to examine the orientation tuning profile of a model observer that compares a specific view of a face to the average encompassing all views of a face identity. Since the horizontal range is proposed to carry the view-stable cues to identity, we expect that such a ‘viewpoint-average’ model observer will perform best with horizontally filtered faces and that its orientation tuning profile will significantly predict human performance across views. We expect the viewpointtolerant and viewpoint-average observers will behave similarly as they manifest the stability of the horizontal identity cues across variations in viewpoint.

Besides this larger issue, I would also like to see some more details about the nature of the crosscorrelation that is the basis for this model comparison. I mostly think I get what is happening, but I think the authors could expand more on the nature of their noise model to make more explicit what is happening before these cross-correlations are taken. I infer that there is a noise-addition step to get them off the ceiling, but I felt that I had to read between the lines a bit to determine this.

The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

Another thing that I think is worth considering and commenting on is the stimuli themselves and the extent to which this may limit the outcomes of their behavioral task. The use of the 3D laserscanned faces has some obvious advantages, but also (I think) removes the possibility for pigmentation to contribute to recognition, removes the contribution of varying illumination and expression to appearance variability, and perhaps presents observers with more homogeneous faces than one typically has to worry about. I don't think these negate the current results, but I'd like the authors to expand on their discussion of these factors, particularly pigmentation. Naively, surface color and texture seem like they could offer diagnostic cues to identity that don't rely so critically on horizontal orientations, so removing these may mean that horizontal bias is particularly evident when face shape is the critical cue for recognition.

We indeed got rid of surface color by converting images to gray scales. While we acknowledge that the conversion to grayscales may have removed one potential source of surface information, it is unlikely that our stimuli fully eliminated the contribution of surface pigmentation in our study. Pigmentation refers to all surface reflectance property (Russell, Sinha et al. 2006) and hue (color) is only one surface cue among others. The grayscaled 3D laser scanned faces used here still contained natural variations in crucial surface cues such as skin albedo (i.e., how light or dark the surface appears) and texture (i.e., spatial variation in how light is reflected). Both color and grayscale stimuli (2D face pictures or 3D laser scanned faces like ours) have actually been used to disentangle the role of shape and surface cues to identity recognition (e.g., Troje and Bulthoff 1996, Vuong, Peissig et al. 2005, Russell, Sinha et al. 2006, Russell, Biederman et al. 2007, Jiang, Dricot et al. 2009).

More fundamentally, we demonstrated that the diagnosticity of the horizontal range of face information is not restricted to the transmission of shape cues. Our recent work has indeed shown that the processing of both face shape and surface most critically relies on horizontal information (Dumont, Roux-Sibilon and Goffaux 2024).

Reviewer #2 (Public review):

This study investigates the visual information that is used for the recognition of faces. This is an important question in vision research and is critical for social interactions more generally. The authors ask whether our ability to recognise faces, across different viewpoints, varies as a function of the orientation information available in the image. Consistent with previous findings from this group and others, they find that horizontally filtered faces were recognised better than vertically filtered faces. Next, they probe the mechanism underlying this pattern of data by designing two model observers. The first was optimised for faces at a specific viewpoint (viewselective). The second was generalised across viewpoints (view-tolerant). In contrast to the human data, the view-specific model shows that the information that is useful for identity judgements varies according to viewpoint. For example, frontal face identities are again optimally discriminated with horizontal orientation information, but profiles are optimally discriminated with more vertical orientation information. These findings show human face recognition is biased toward horizontal orientation information, even though this may be suboptimal for the recognition of profile views of the face.

One issue in the design of this study was the lowering of the signal-to-noise ratio in the viewselective observer. This decision was taken to avoid ceiling effects. However, it is not clear how this affects the similarity with the human observers.

The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

Another issue is the decision to normalise image energy across orientations and viewpoints. I can see the logic in wanting to control for these effects, but this does reflect natural variation in image properties. So, again, I wonder what the results would look like without this step.

Energy of natural images is disproportionately distributed across orientations (e.g., Hansen, Essock et al. 2003). Images of faces cropped from their background as used here contain most of their energy in the horizontal range (Keil 2009, Goffaux and Greenwood 2016, Goffaux 2019). If not normalized after orientation filtering, such uneven distribution of energy would boost recognition performance in the horizontal range across views. Normalization was performed across our experimental conditions merely to avoid energy from explaining the influence of viewpoint on the orientation tuning profile.

We are not aware of any systematic natural variations of energy across face views. To address this, we measured face average energy (i.e., RMS contrast) in the original stimulus set, i.e., before the application of any image processing or manipulation. Background pixels were excluded from these image analyses. Across yaws, we found energy to range between .11 and .14 on a 0 to 1 grayscale. This is moderate compared to the range of energy variations we measured across identities (from .08 to .18). This suggests that variations in energy across viewpoints are moderate compared to variations related to identity. It is unclear whether these observations are specific to our stimulus set or whether they are generalizable to faces we encounter in everyday life. They, however, indicate that RMS contrast did not substantially vary across views in the present study and suggest that RMS normalization is unlikely to have affected the influence of viewpoint on recognition performance.

Nonetheless, we acknowledge the importance of this issue regarding the trade-off between experimental control and stimulus naturalness, and we will refer to it explicitly in the methods section.

Despite the bias toward horizontal orientations in human observers, there were some differences in the orientation preference at each viewpoint. For example, frontal faces were biased to horizontal (90 degrees), but other viewpoints had biases that were slightly off horizontal (e.g., right profile: 80 degrees, left profile: 100 degrees). This does seem to show that differences in statistical information at different viewpoints (more horizontal information for frontal and more vertical information for profile) do influence human perception. It would be good to reflect on this nuance in the data.

Indeed, human performance data indicates that while identity recognition remains tuned to horizontal information, horizontal tuning shows some variation across viewpoints. We primarily focused on the first aspect because of its direct relevance to our research objective, but also discussed the second aspect: with yaw rotation, certain non-horizontal morphological features such as the jaw line or nose bridge, etc. may increasingly contribute to identity recognition, whereas at frontal or near frontal views, features are mostly horizontally-oriented (e.g., Keil 2008, Keil 2009). We will relate this part of the discussion more explicitly to the observation of the fluctuation of the peak location as a function of yaw.

https://doi.org/10.7554/eLife.108495.1.sa3

The view-tolerance of human identity recognition depends on horizontal face information

Peer review process

Editors

Be the first to read new articles from eLife