1. Neuroscience
Download icon

The neurons that mistook a hat for a face

  1. Michael J Arcaro  Is a corresponding author
  2. Carlos Ponce
  3. Margaret Livingstone
  1. Department of Psychology, University of Pennsylvania, United States
  2. Department of Neuroscience, Washington University in St. Louis, United States
  3. Department of Neurobiology, Harvard Medical School, United States
Research Article
  • Cited 1
  • Views 1,664
  • Annotations
Cite this article as: eLife 2020;9:e53798 doi: 10.7554/eLife.53798

Abstract

Despite evidence that context promotes the visual recognition of objects, decades of research have led to the pervasive notion that the object processing pathway in primate cortex consists of multiple areas that each process the intrinsic features of a few particular categories (e.g. faces, bodies, hands, objects, and scenes). Here we report that such category-selective neurons do not in fact code individual categories in isolation but are also sensitive to object relationships that reflect statistical regularities of the experienced environment. We show by direct neuronal recording that face-selective neurons respond not just to an image of a face, but also to parts of an image where contextual cues—for example a body—indicate a face ought to be, even if what is there is not a face.

Introduction

Our experience with the visual world guides how we understand it. The regularity of our experience of objects within environments and with each other provides a rich context that influences our perception and recognition of objects and categories (Bar, 2004; Greene, 2013). We do not often see faces in isolation, rather our experience of faces is embedded within a contextually rich environment – a face is usually conjoined to a body – and this regular structure influences how we perceive and interact with faces (Bar and Ullman, 1996; Meeren et al., 2005; Rice et al., 2013). However, it is generally found that neurons critical for the perception of faces (Afraz et al., 2015; Afraz et al., 2006; Moeller et al., 2017) do not respond to bodies, and vice versa (Freiwald and Tsao, 2010; Freiwald et al., 2009; Kiani et al., 2007; Tsao et al., 2006), suggesting that such neurons code for the intrinsic features of either faces or bodies but not both (Chang and Tsao, 2017; Issa and DiCarlo, 2012; Leopold et al., 2006). These studies typically probe neural function by presenting objects in isolation, independent of their typical context.

Here, we reconcile this apparent discrepancy between behavior and neural coding. We recorded from two face-selective domains within inferotemporal cortex (IT) in two macaque monkeys while they viewed complex natural images. By considering the spatial relationships of objects in a natural scene relative to a neuron’s receptive field, we show that face-selective IT neurons respond not just to an image of a face, but to parts of an image where contextual cues—such as a body—indicate a face ought to be, even if what is there is not a face. Thus, we find that face cells demonstrate a large contextual surround tuned to entities that normally co-occur with faces. These results reveal surprisingly complex and heterogeneous inputs to face-selective neurons, suggesting that IT neurons do not represent objects in isolation, but are also sensitive to object relationships that reflect statistical regularities of the experienced environment.

Results

We recorded simultaneously from face-selective neurons in the fMRI-defined middle lateral (ML) and posterior lateral (PL) face patches in one rhesus macaque monkey, Monkey 1, and from ML in another monkey, Monkey 2, using chronically implanted 32-channel multi-electrode arrays (Figure 1—figure supplement 1). When mapped using ~2°x2° faces and round non-face objects, the activating regions were small (1-3°), located in the central visual field (within 2° of fixation, and largely contralateral; Figure 1—figure supplement 2). When tested with images of faces, hands, bodies (only for ML arrays), and objects presented centered on this activating region, all the visually responsive sites in both monkeys responded more to faces than to any other category (Figure 1—figure supplement 3). Category selectivity was assessed with an index (category1-response –category2-response)/ (category1-response + category2-response). In particular, in both monkeys’ ML face patches, responses to faces were much larger than to non-face images (mean channel face vs. non-face selectivity indices = 0.63 +/- 0.06 STD and 0.64 +/- 0. 14 for Monkey’s 1 and 2) or to bodies (mean channel face vs. body selectivity indices = 0.77 +/- 0.11 and 0.60 +/- 0.21). Responses to bodies were weak (body vs. object selectivity indices = −0.54 +/- 0.17 STD and 0.04 +/- 0.27 for ML in both Monkeys 1 and 2). Such a strong preference for faces over other categories is consistent with many previous studies (Bell et al., 2011; Tsao et al., 2006). The array in Monkey 1’s PL face patch was less face selective (mean channel face vs. non-face selectivity indices = 0.24 +/- 0.07 STD) and responses to hands were weak (mean channel hands vs. objects indices = 0.04 +/- 0.02 STD), also consistent with previous work (Issa and DiCarlo, 2012).

To explore how response properties of face neurons represent naturalistic scenes, we mapped the spatial organization of face-selective neurons by presenting complex, familiar and unfamiliar scenes at various positions relative to each site’s activating region (Figure 1). Conventionally IT neurons are studied by presenting single-object images at a single position while the animal holds fixation (Figure 1—figure supplement 3). Here instead, while the animal held fixation, we presented large 16° x 16° complex natural images at a series of different positions, such that different parts of the image were centered over the receptive field across presentations (Figure 1), in order to resolve the discrete spatial pattern of responses for each site to different parts of the image. The recording site illustrated in Figure 1C responded most strongly to the parts of that image that contained monkey faces. We had 28 to 32 visually responsive sites in each of the three implanted arrays. Within each array the responses were all face selective (Figure 1—figure supplement 3) and showed similar receptive-field locations and spatial response patterns to complex scenes (Figure 1—figure supplements 2 and 4). Given the uniformity of responses across channels in each array, we combined all visually responsive sites in each array for all subsequent analyses to improve signal to noise and present population-level results. Importantly, our findings are not contingent on individual unit/recording site effects. Figure 1 shows the spatial pattern of responses to a series of images from the PL array in Monkey 1, and from the ML arrays in Monkeys 1 and 2. Consistent with the PSTHs of image categories presented in isolation (Figure 1—figure supplement 3), all three arrays showed responsiveness that was spatially specific to the faces in each image. Consistent with prior physiological and anatomical work suggesting that face patches are organized into a hierarchy that varies along the posterior-anterior axis (Grimaldi et al., 2016; Moeller et al., 2008), ML recording sites were more face selective and had longer latencies compared to the PL sites during simultaneous recordings (Figure 1—figure supplement 5). Activity specific to the eyes is apparent in both PL and ML (Figure 2). Activity encompassing the rest of the face emerges at longer latencies. This is consistent with prior work showing a preference for the contralateral eye in area PL that emerges between 60–80 ms post stimulus onset and longer latencies for images without a contralateral eye (Issa and DiCarlo, 2012). A similar shift in response latencies was evident when replacing the eyes with a visible occluder, the background of the image, and surrounding texture of the face (Figure 2—figure supplement 1). In our recordings, both PL and ML exhibited a shift in the latency for activity corresponding to faces without eyes, but PL activity still preceded ML (Figure 2; Figure 2—figure supplement 1), indicating that in the absence of eyes, face-specific activity still emerges across the ventral visual hierarchy in a bottom-up fashion.

Figure 1 with 5 supplements see all
Experimental schematic and spatial selectivity of responsiveness.

(A) The monkey maintains fixation (red spot) while a large (16° x 16°) complex natural scene is presented at locations randomly sampled from a 17 × 17 (Monkey 1) or 9 × 9 (Monkey 2) grid of positions centered on the array’s receptive field (blue dotted circle). On a given trial, 3 different images are presented sequentially at 3 different positions. Across the experiment, each image is presented multiple times at each position on the grid. (B) Parts of a 16° x 16° image (black dots) that were centered within each array’s receptive fields across trials from the 9 × 9 stimulus presentation grid; position spacing, 2°. (C) Map of a face cell’s firing rate (100–250 ms after stimulus onset) in response to different parts of the image positioned over the receptive field (recording site 15 in Monkey 1 ML array). The most effective parts of this image were the monkey faces. (D) Complex natural images shown to the monkey. Population-level response maps to these images from (E) PL in Monkey 1; (F) ML in Monkey 1; and (G) ML in Monkey 2. Firing rates were averaged over 100–250 ms post stimulus onset for Monkey 1 and 150–300 ms for Monkey 2. For E, F, and G, the color scale represents the range between 0.5% and 99.5% percentiles of the response values to each image. All 3 arrays showed spatially specific activation to the faces in the images, with higher selectivity in the middle face patches. (H) PSTHs of responses to face (red) and non-face (blue) parts of these images for (top) Monkey 1 PL, (middle) Monkey 1 ML, and (bottom) M2 ML. Shading represents 95% confidence limits. Dashed red lines denote time window with a significant difference in response magnitude between face and non-face image parts (paired t-test across 8 images; t(7) > 2.64; p<0.05, FDR-corrected).

© 2014 Patrick Bolger. All rights reserved. The image in Panel B is used with permission from Patrick Bolger@Dublin Zoo. It is not covered by the CC-BY 4.0 license and further reproduction of this panel would need permission from the copyright holder.

© 2009 Ralph Siegel. All rights reserved. The second photograph in Panel D is courtesy of Ralph Siegel. It is not covered by the CC-BY 4.0 license and further reproduction of this panel would need permission from the copyright holder.

© 2012 A Stubbs. All rights reserved. The fourth photograph in Panel D is courtesy of A Stubbs. It is not covered by the CC-BY 4.0 license and further reproduction of this panel would need permission from the copyright holder.

© 2008 Johnny Planet. All rights reserved. The fifth photograph in Panel D is courtesy of Johnny Planet. It is not covered by the CC-BY 4.0 license and further reproduction of this panel would need permission from the copyright holder.

Figure 2 with 1 supplement see all
Cells in PL and ML responded to faces with and without eyes.

Population-level response maps from Monkey 1 PL and ML for 40 ms bins between 50 and 250 ms post stimulus onset. Face-specific activity arises in PL prior to ML for images both with and without eyes, even though the latency for each depends on the presence of eyes. (top) Activity specific to eyes emerges earlier than responses to the rest of the face in both PL and ML. (middle and bottom) For both PL and ML, face activity emerges at longer latencies when the eyes are blacked out. Photo courtesy of Ralph Seigel. Response maps scaled to 0.5% and 99.5% percentiles of the response values to each image.

© 2014 Santasombra. All rights reserved. The top-left image is reproduced with permission from Santasombra. It is not covered by the CC-BY 4.0 license and further reproduction of this panel would need permission from the copyright holder.

To explore contextual associations of natural face experience, we then presented complex scenes that lacked faces, but contained cues that indicated where a face ought to be. When presented with images in which the face was occluded by some object in the image, surprisingly, the face cells in ML in both monkeys responded to the part of the image where the face ought to be, more than to non-face parts of the images (Figure 3). Though noisier than the population average, responses to occluded faces were apparent in individual channels (Figure 3—figure supplement 1). The magnitude and timing of responses to occluded faces varied across stimuli. Across all stimuli, the population-level activity to occluded faces was reliably higher than the average response to the rest of each image (t-test across 12 and 10 images for Monkeys 1 and 2, respectively; p<0.05, FDR-corrected). Interestingly, the responses to occluded faces on average were slower than the responses to intact faces by ~30–40 ms (Figure 3—figure supplements 2 and 3). The latency of responses to occluded faces was similar to that to eyeless faces (Figure 2; Figure 2—figure supplement 1). By 119 ms and 139 ms post stimulus onset for Monkey 1 and Monkey 2, respectively, the responses to occluded faces were significantly larger than the responses to the rest of the image. In contrast, responses to non-occluded versions of these faces were significantly larger than the responses to the rest of the image starting at 92 ms and 97 ms post stimulus onset for Monkey 1 and Monkey 2, respectively. It is possible that the face cells in ML responded to some tiny part of the head or neck that was still visible in each of the occluded-face photos, though similar visual features were present in other parts of each image that did not evoke strong firing. It is also unlikely that the occluding objects contained visual features that alone evoke strong firing as those objects were present in other parts of each image that did not evoke strong firing (e.g., the section of the rug in-between the two faces and boxes covering the face of the man walking down the stairs vs. boxes below the face). Furthermore, when presented with people wearing masks, baskets or paper bags covering their faces, cells in ML responded to the masked faces (Figure 4). Similar to the occluded faces, responses to masked faces were also apparent in individual channels (Figure 4—figure supplement 1). The population-level responses to masked faces varied in magnitude and timing but were consistently stronger than average responses to the rest of each image (t-test across 10 and 11 images for Monkeys 1 and 2, respectively; p<0.05, FDR-corrected). The latency of responses to masked faces (Figure 4—figure supplements 2 and 3) was also similar to that to eyeless faces (Figure 2; Figure 2—figure supplement 1). Face cells in PL also responded more to occluded faces than to the non-face parts of the images (Figure 4—figure supplement 4; p<0.05, FDR corrected). Though cells in PL also responded on average more strongly to masked faces than to other parts of each image, this difference was not significant (Figure 4—figure supplement 4; p>0.05, FDR corrected). Thus, responses to occluded faces were more prominent at later stages of the ventral visual hierarchy.

Figure 3 with 3 supplements see all
Cells in ML in both monkeys responded to occluded faces.

(top) Images of occluded faces. Population-level response maps from (middle) Monkey 1 ML and (bottom) Monkey 2 ML. Firing rates were averaged over 100–250 ms post stimulus onset for Monkey 1 and 150–300 ms for Monkey 2. Response patterns scaled to 0.5% and 99.5% percentiles of the response values to each image. (right) PSTHs of responses in PL in Monkey 1 to regions of the image with occluded (orange) and non-occluded (red) faces and non-face parts (blue). Shading represents 95% confidence limits. (top) Responses of Monkey 1 ML. (bottom) Responses of Monkey 2 ML. Dashed red (and orange) lines denote time windows with significant differences in response magnitudes between face and occluded-face (and occluded vs. non-face) image parts (paired t-test across 12 images for Monkey 1; t(11) > 2.80 for occluded-face vs. non-face; t(11) > 2.70 for face vs. occluded-face; and 10 images for Monkey 2 t(9) > 2.56 for occluded-face vs. non-face; t(9) > 2.73; p<0.05, FDR-corrected). Note, the small spike just after 300 ms in Monkey 1 is a juicer artifact from the reward delivered after each trial and at least 200 ms prior to the onset of the next stimulus presentation. Histograms show response differences to occluded-face regions minus non-face control regions, for each image, each channel (blue) and for the mean image across channels (orange).

Figure 4 with 4 supplements see all
Responses to masked faces.

Cells in ML in both monkeys responded to faces covered by masks, baskets, or paper bags. (top) Images of covered faces. Population-level response maps from (middle) Monkey 1 ML and (bottom) Monkey 2 ML. Firing rates were averaged 100–250 ms post stimulus onset for Monkey 1 and 150–300 ms for Monkey 2. Response patterns scaled to 0.5% and 99.5% percentiles of the response values to each image. (right) PSTHs of responses of sites in ML in Monkeys 1 and 2 to covered faces (orange). Shading represents 95% confidence limits. (top) Responses of sites in Monkey 1 patch ML. (bottom) Responses of sites in Monkey 2 ML. Dashed orange lines denote time windows with significant differences in response magnitudes between covered-face vs. non-face image parts (paired t-test across 10 images for Monkey 1 and 11 images for Monkey 2; t(9) > 2.81 for Monkey 1; t(10) > 2.42 for Monkey 2; p<0.05, FDR-corrected). Histograms show response differences to masked face minus non-face control regions, for each image, each channel (blue) and for the mean image across channels (orange).

To ask whether the face-cell responses to masked or occluded faces were due specifically to the context of a body just below the occluded face, we manipulated some images so that a non-face object was present in an image twice, once above a body, and once not (Figure 5, top). In both monkeys’ ML patches, responses to the non-face object positioned above a body were larger than responses to the same object when not above a body, both at the population level (Figure 5) and in individual channels (Figure 5—figure supplement 1). Similar to the occluded and masked faces, population-level responses to these face-swapped images varied in magnitude and latency, but responses to non-face objects were consistently stronger when positioned above a body than when positioned elsewhere (t-test across 8 and 15 images for M1 and M2, respectively; p<0.05, FDR-corrected). Similar to the occluded and masked faces, the latency of responses to the non-face object above a body (Figure 5—figure supplements 2 and 3) was comparable to that to eyeless faces (Figure 2; Figure 2—figure supplement 1). Though cells in PL also responded on average more strongly to objects on top of bodies, this difference was not significant (Figure 4—figure supplement 4; p>0.05, FDR-corrected). Thus, responses in ML to a non-face image were facilitated by a body below it.

Figure 5 with 3 supplements see all
Face-like responses to non-face objects.

(top) Manipulated images with non-face objects in positions where a face ought and ought not to be. Population-level response maps from (middle) Monkey 1 ML and (bottom) Monkey 2 ML. Firing rates were averaged 100–250 ms post stimulus onset for Monkey 1 and 150–300 ms for Monkey 2. Response patterns scaled to 0.5% and 99.5% percentiles of the response values to each image. (right) PSTHs of Monkey 1 PL to non-face images above a body (orange) or not above a body (green). (top) PSTHs from Monkey 1 ML. (bottom) PSTHs from Monkey 2 ML. Shading indicates 95% confidence intervals. Dashed orange lines denote time windows with significant differences in response magnitudes between object-atop-body vs. object-apart-from-body (paired t-test across 8 and 15 images for Monkey 1 and 2, respectively; t(7) > 3.04 for Monkey 1; t(14) > 2.38 for Monkey 2; p<0.05, FDR-corrected). Histograms show response differences to object atop body minus object control regions for each image, each channel (blue) and for the mean image across channels (orange).

Given our finding that bodies facilitated face-cell responses to masked, occluded, and non-face objects positioned above them, we might expect bodies to also facilitate responses to faces appropriately positioned above bodies, but this was not the case. Instead, responses to faces above bodies were indistinguishable from responses to disembodied faces (Figure 6, leftmost; Figure 6—figure supplements 1 and 2, top two rows). This is consistent with the response maps to a subset of images that showed strong responses to faces regardless of the spatial relation to bodies (columns 3–5 in Figure 5, top). In contrast, responses to face-shaped noise patches (i.e. where face pixels were replaced by random 3-color channel pixel values), face outlines, and non-face objects were facilitated by the presence of a body positioned below (Figure 6; Figure 6—figure supplements 1 and 2). Face-shaped noise patches and face-shaped outlines both elicited responses above baseline even in the absence of the body (middle two columns). This was likely due to these images containing typical shape features of a face, which is sufficient to drive these neurons (Issa and DiCarlo, 2012; Tsao et al., 2006). Importantly, the responses to these images when placed above a body were significantly larger (p<0.05, FDR-corrected). Further, non-face objects activated these neurons only when placed above bodies (Figure 6 rightmost column, p<0.05, FDR-corrected; Figure 6—figure supplements 1 and 2, bottom row). Together, these data suggest a ceiling effect--that bodies facilitate responses of face cells only when the presence of a face is ambiguous. Further, these neurons respond to the region above a body even if no foreground object is present (Figure 6—figure supplement 3). In the absence of a face, the presence of a body below the receptive field appears to be sufficient for driving these face-selective cells.

Figure 6 with 3 supplements see all
Body-below facilitation of face-cell responses to non-face objects but not to faces.

Images of face, noise, face outline, and non-face object (top row) above a body and (second row) without a body. (bottom two rows) PSTHs from ML in Monkeys 1 and 2. Responses were calculated over the same face-shaped region for all images, for 7 different image sets of different individuals. Shading indicates 95% confidence intervals. Dashed colored lines denote time windows with significant differences in response magnitudes between face region above body vs. without body (paired t-test across 7 images; t(6) > 2.79 for noise; t(6) > 3.08 for outlines; t(6) > 2.76 for non-face objects; p<0.05, FDR-corrected). Inset histograms show response differences to the regions atop body minus the corresponding regions of no-body images, for each image, each channel (blue) and for the mean image across channels (orange).

If bodies positioned directly below the activating region facilitate face-cell responses, is it because face cells have a discrete excitatory region, below the activating region, that is simply responsive to bodies? Or is the effect more complex, reflecting contextual information that a body should be accompanied by a face? To address this question, we mapped responsiveness to bodies, with and without heads, at varying orientations and locations in a 9 × 9 grid pattern (Figure 7—figure supplement 1). For bodies-with-heads, responses were maximal when the face was centered in the cells’ activating region, for each orientation (Figure 7A, top row). Headless bodies produced activation maps that were similarly maximal at positions in which a face ought to be centered on the cells’ activating region, not just when the body was positioned below the activating region (Figure 7A, bottom two rows). The magnitude of firing rate varied across the tested visual field locations (e.g., firing rates tended to be higher in the upper half of the image). Composite maps show the preferred body orientation for each spatial position tested (Figure 7A, right side). In experiments both with and without heads, preferred orientation systematically varied such that the preferred body orientation at that spatial location positioned the face, or where the face ought to be, within the RF center. Interestingly, these neurons were also responsive when the feet landed in the activating region, for the bodies without heads, and, to a lesser extent, the bodies with heads. This was especially apparent for inverted bodies, suggesting that both end-of-a-body and above-a-body are cues that a face might be expected. The surprising result that face cells are maximally responsive to where a face would be expected, largely irrespective of the body orientation, indicates that the responses to occluded faces documented in Figures 3 and 4 are not responses merely to bodies located below the cells’ face-selective regions, but instead reflect information about where a face ought to be, for a given body configuration.

Figure 7 with 3 supplements see all
Body-orientation selectivity.

(A) Upper left panel shows an example head-only image overlaid on the population-level activation map averaged across all head only images (n = 8) from Monkey 1 ML. The upper row shows an example whole body image at eight different orientations each overlaid on the average activation maps for all whole bodies (n = 8); maximum responsiveness was focal to the face at each orientation. The lower two rows show a similar spatial selectivity for the region where a face ought to be at each orientation for headless bodies (n = 8) in Monkey 1 and 2 ML. Whole and headless body image sets were presented on a uniform white background (Figure 7—figure supplement 1). The composite maps on the far right show the best body orientation at each grid point for whole bodies and headless bodies for face patch ML in both monkeys, as indicated. The scale of the body orientations from top to bottom corresponds to the body orientations shown left to right in the response maps. Black sections denote areas with responses at or below baseline. For both whole and headless bodies, maximal activation occurred for bodies positioned such that the face (or where the face ought to be) would be centered on the cells’ activating region. (B) PSTHs when whole body and face-shaped noise atop bodies were presented within the cell’s activating region in both monkeys. For whole body images, there was only a brief difference in activity between upright and inverted configurations during the initial response. For face-shaped noise, the response was substantially larger for the upright (vs. inverted) configurations for the duration of the response. Dashed colored lines denote time windows with significant differences in response magnitudes between upright vs. inverted face regions of the image (t-test across 7 images; t(6) > 4.74 for intact faces; t(6) > 3.34 for noise; p<0.05, FDR-corrected).

If bodies convey information to face cells about where a face ought to be, are there preferred face-body configurations? When face and face-shaped noise were presented attached to bodies in upright and inverted configurations, the strongest response was when the face (or face-shaped noise) was presented at the center of the activating region (Figure 7B). For intact faces with bodies presented at the center of the activating region, there was only a small inversion effect (Figure 7B; whole bodies). While reaction time differences between upright and inverted faces are robust and reliable across studies (Leehey et al., 1978; Yin, 1969), differences in response magnitudes within IT face domains are modest or not consistently found in humans (Aguirre et al., 1999; Haxby et al., 1999; Kanwisher et al., 1998) or monkeys (Bruce, 1982; Pinsk et al., 2009; Rosenfeld and Van Hoesen, 1979; Taubert et al., 2015; Tsao et al., 2006). The lack of a strong inversion effect for intact-faces-over-bodies compared to a stronger inversion effect for non-face-objects-above-bodies could reflect a ceiling effect, or it could mean that, in the presence of a face, tuning is more flexible and more invariant to spatial transformations. In contrast to responses to intact faces, there was a large, sustained difference in response magnitude between upright and inverted face-shaped noise patches with bodies (Figure 7B; noise atop bodies), suggesting that the body facilitation is maximal when the presence of a face is ambiguous and above a body. The preference for upright vs. inverted configurations with the face-shaped noise is consistent with biases apparent in Figure 7A. Though face cells in ML were responsive to headless bodies at a variety of spatial positions, activity tended to be strongest when bodies were positioned below the activating region (Figure 7A; bottom two rows). This was particularly clear in Monkey 2 where the part of the image that gave the strongest response for the inverted (180-degree) body orientation was the feet, meaning that the inverted body still drove neurons most when it was positioned just below the activating region. Thus, for face-ambiguous stimulation, contextual responses are strongest to spatial configurations that reflect typical experience.

How can face-selective neurons respond to bodies positioned below their receptive fields? One possibility is that such contextual sensitivity is intrinsic to IT and pre-specified for the processing of biologically important information. Alternatively, given that the development of face domains depends on seeing faces during development (Arcaro et al., 2017), these conjoint representations might also be learned over development as a consequence of frequently seeing faces and bodies together. To explore whether learning is sufficient to link face and body representations, we analyzed the representation of image categories in a computational model that has similar architectural features to the object processing pathway in primates. Specifically, we took a convolutional neural network, AlexNet (Krizevsky et al., 2012), trained to classify natural images into 1000 object categories, and assessed its activation patterns evoked by images of isolated objects from 6 different categories; human faces, headless human bodies, animals, houses, inanimate objects, and handheld tools (Figure 8A, top). We tested a stimulus set widely used for localizing category-selective regions in IT cortex (Downing et al., 2006). We assessed the similarity of activation patterns between all image pairs in every layer of AlexNet and in pixel space. As seen in the multidimensional scaling visualization, images are distinguished along categorical distinctions even based on just pixel intensity (Figure 8A, bottom). The Euclidean distances between image pairs within each category were smaller than between categories, indicating that images were more homogenous within than between categories and highlighting that low-level features such as shape, color, or texture are a major component of these commonly tested ‘semantic’ categories (also see Dailey and Cottrell, 1999). Distinctions between categories were present in early layers of AlexNet (Figure 8A, bottom). The distinction between faces and non-faces persisted across all layers of AlexNet. The mean distance between face pairs (red) was smaller than the mean distance between faces and non-body, non-face categories (animals, houses, objects, and tools). Across all rectified linear unit (relu) layers of AlexNet, activation patterns were more similar for within category comparisons of both faces and bodies (Figure 8B, solid red and green circles, Spearman rank correlations) than between faces and other categories (dual colored circles). At the pixel level and in early layers of AlexNet, the mean similarity between faces and bodies (half red/half green circles) was greater than the similarity between faces and the other categories; this similarity was also stronger to that obtained under randomization of the stimulus labels for non-face categories (this randomization distribution is marked by the grey shaded region that denotes 2.5% and 97.5% bounds). In the deeper layers of AlexNet (relu6 and relu7), face and body image representations become more overlapping (Figure 8B, inset). In relu7, the correlation between face and body image pairs were only slightly weaker than the correlation between pairs of body images. Comparable effects were observed when assessing similarity with Pearson correlations and Euclidean distance on z-score normalized data. Further, these results were specific to the trained AlexNet, as untrained versions of the network did not yield strong face-body correlations and these correlations were no different than correlations between faces and other categories (Figure 8—figure supplement 1). These results demonstrate that representations of faces and bodies can merge over learning without any explicit training about faces and bodies. This is particularly impressive given that AlexNet was not trained to categorize faces, bodies, or even humans. Because faces and bodies co-occur in people and people were present in the training images, faces and bodies share variance in their informativeness with respect to class labels. There is thus pressure for the network to use overlapping representations in the face and body hidden layer representations in this supervised learning scenario. This phenomenon should hold more generally in unsupervised or self-supervised settings as the presence of image co-occurrence in the input would be expected to lead to similar overlap. Though convolutional networks can capture some aspects of the representational structure of biological networks (Hong et al., 2016; Ponce et al., 2019), these networks as currently implemented are not well suited for probing the retinotopic specificity of object representations, because they are explicitly constrained to process inputs from different portions of an image in the same way. Future work could probe the spatial relationships between commonly experienced object categories in computational models more optimized for maintaining spatial coding.

Figure 8 with 1 supplement see all
Learned face and body representations in hierarchical artificial neural network (AlexNet).

Representational similarity of a stimulus set comprising 240 images across 6 categories including faces and bodies. (A, top) Two example images from each category. (A, bottom) Visualization of similarity between faces (red), bodies (green), houses (blue), objects (green), and tools (pink) from multidimensional scaling of (left column) pixel intensity, (middle column) relu1 units, (right column) relu7 units) Euclidean distances. (B) Comparison of representational similarity within and between categories. Spearman rank correlations were calculated between image pairs based on pixel intensity and activations within AlexNet layers for all object categories. Mean similarity within (solid filled circles) and between (dual colored circles) categories are plotted for pixel intensity and the relu layers of AlexNet. The 2.5% and 97.5% percentiles from a permutation test in which the non-face stimulus labels were shuffled before calculating correlations (grey shaded region) are plotted for pixel and relu layers of AlexNet comparisons. Across all layers, both face and body images are more similar within category than between categories. (inset) The representations of faces and bodies are most similar to each other in deep layers (relu6 and relu7) of AlexNet.

Discussion

Our finding that face-selective neurons respond to parts of an image where a face ought to be—above a body—goes well beyond previous observations on category selectivity in inferotemporal cortex. Previous neuronal recordings in macaques found that face-selective neurons respond only weakly to images of headless bodies, and body-selective neurons poorly to faces, suggesting that face and body circuits are largely independent (Freiwald and Tsao, 2010; Freiwald et al., 2009; Kiani et al., 2007; Tsao et al., 2006). Human fMRI studies have also provided evidence that face and body processing are independent (Kaiser et al., 2014; Peelen and Downing, 2005; Peelen et al., 2006; Schwarzlose et al., 2005) and that face-selective domains in the FFA do not respond to headless animals or human bodies presented at fixation (Kanwisher et al., 1999; Peelen and Downing, 2005; Schwarzlose et al., 2005). Consistent with these prior findings, the face neurons we recorded from responded strongly and selectively to faces presented over a limited part of the visual field but did not respond to bodies presented to the same region. However, these neurons also responded to masked and occluded faces that lacked intrinsic facial features, but were above bodies. Further, when non-face objects that did not by themselves evoke a response from these neurons were placed above a body, the face-selective neurons fired robustly. Consistent across all six experiments, these neurons responded to bodies positioned such that the body implied the presence of a face within the neurons’ activating regions. In prior studies, both headless bodies and faces were presented at fixation, so our results suggest that these studies might have found larger responses to bodies if the bodies had been presented lower in the visual field, below the center of maximal activation to faces. Thus, face cells demonstrate a large contextual surround tuned to entities that normally co-occur with faces.

Though prior studies have mostly provided evidence for the independence of face- and body- processing circuits, a few fMRI studies found facilitatory influences of bodies on blood-flow responses in face-selective brain regions: The face fusiform area (FFA) in humans has been reported to be responsive to animal bodies with obscured faces (Chao et al., 1998) or to blurred faces above bodies (Cox et al., 2004). Cox et al., 2004 also found body-below facilitation of BOLD signal for blurred faces but not for clear faces, and they found that the BOLD signal facilitation depended on the face and body being in a correct spatial configuration. Both these findings are consistent with our neuronal recording results. In monkeys, the most anterior face patches show stronger fMRI activations to images of whole monkeys than to the sum of the responses to faces and to bodies; however, the middle and posterior face patches (which we recorded here) do not show facilitation, and have only sub-additive responses (Fisher and Freiwald, 2015). Thus, although there is some fMRI evidence for interactions between face and body processing, fMRI studies are limited in that they can measure responses only of large populations of neurons of potentially mixed selectivity and therefore cannot resolve the relationship of face and body neural circuits. Here we show by direct neuronal recording that face and body circuits are not independent and that face-selective neurons can express large, complex, silent, contextual surround effects. From simultaneously recording neurons in PL and ML face patches, we show that these responses emerge fast and are first detected in the posterior patch, suggesting that contextual effects involve at least some feedforward inputs that cannot be attributed to cognitive processes such as mental imagery or expectation. The idea that the context of a visual scene leads to the expectation, enhanced probability, or prior of seeing some particular item is usually thought to involve top-down processes. However, if the feedforward inputs to, say, face-selective neurons, are sculpted by experience to be responsive not only to faces, but also to things frequently experienced in conjunction with faces, then such expectations would be manifest as the kind of complex response properties reported here. Furthermore, in contrast to the traditional view of a dichotomy of spatial and object vision, our data demonstrate that object processing in IT is spatially specific (also see Op De Beeck and Vogels, 2000).

More broadly, these results highlight that neurons in inferotemporal cortex do not code objects and complex shapes in isolation. We propose that these neurons are sensitive to the statistical regularities of their cumulative experience and that the facilitation of face-selective neurons in the absence of a clear face reflects a lifetime of experience that bodies are usually accompanied by faces. Recently, we proposed that an intrinsic retinotopic organization (Arcaro and Livingstone, 2017; also see Hasson et al., 2002) guides the subsequent development of category-selectivity in inferotemporal cortex, such as face and body domains (Arcaro et al., 2017). For example, monkeys normally fixate faces, and they develop face domains within regions of inferotemporal cortex that represent central visual space (Figure 9, top). Here, we propose that these same developmental mechanisms can account for contextual learning. Our results indicate that, just as the retinotopic regularity of experiencing faces guides where in IT face domains develop, the spatial regularity between faces and bodies is also learned in a retinotopic fashion and may account for why face and body patches stereotypically emerge in adjacent parts of cortex (Figure 9, bottom) in both humans (Schwarzlose et al., 2005) and monkeys (Bell et al., 2009; Pinsk et al., 2009). Such a spatial regularity is consistent with prior observations of a lower visual field bias for the RF centers of body-selective neurons near ML (Popivanov et al., 2015) and an upper visual field bias for eye-selective neurons in PL (Issa and DiCarlo, 2012). Thus, something as complex and abstract as context may be firmly grounded in the ubiquitous mechanisms that govern the wiring-up of the brain across development (Arcaro et al., 2019).

Preferential looking at faces imposes stereotyped retinal experience of bodies.

(top left) Primates spend a disproportionate amount of time looking at faces in natural scenes. (top right) This retinal input is relayed across the visual system in a retinotopic fashion. (bottom left) Example retinal experience of a scene along the eccentricity dimension when fixating a face. Given that bodies are almost always experienced below faces, preferential face looking imposes a retinotopic regularity in the visual experience of bodies. (bottom right) This spatial regularity could explain why face and body domains are localized to adjacent regions of retinotopic IT.

© 2009 American Physiological Society. All rights reserved. Right images reproduced from Bell et al., 2009. They are not covered by the CC-BY 4.0 license and further reproduction of this figure would need permission from the copyright holder.

Materials and methods

All procedures were approved in protocol (1146) by the Harvard Medical School Institutional Animal Care and Use Committee (IACUC), following the Guide for the care and use of laboratory animals (Ed 8). This paper conforms to the ARRIVE Guidelines checklist.

Animals and behavior

Request a detailed protocol

Two adult male macaques (10–12 kg) implanted with chronic microelectrode arrays were used in this experiment. In monkey 1, two floating microelectrode arrays were implanted in the right hemisphere superior temporal sulcus – one within the posterior lateral (PL) face patch in posterior inferotemporal cortex (PIT) (32-channel Microprobes FMA; https://microprobes.com/products/multichannel-arrays/fma) and a second array within the middle lateral (ML) face patch in central inferotemporal cortex (CIT) (128-channel NeuroNexus Matrix array; http://neuronexus.com/products/primate-large-brain/matrix-array-primate/). In monkey 2, one 32-channel floating microelectrode array was implanted in left hemisphere ML. Monkeys were trained to perform a fixation task. Neural recordings were performed on a 64-channel Plexon Omniplex Acquisition System. For monkey 1, recordings were performed simultaneously for the 32-ch PIT array and 32-ch from the CIT array. In both monkeys, the same channels were recorded from for each experimental session. The internal dimensions of the Microprobes array was 3.5 × 1.5 mm with a spacing of 370–400 microns between electrodes (across the width and length of the array respectively). The Neuronexus matrix array was a comb arrangement, with 200 micron spacing between the 4 shanks and 400 microns site spacing along the shanks. The arrays were oriented with the long axis AP. The task required the monkey to keep their gaze within ± 0.75° of a 0.25° fixation spot. We used an ISCAN eye monitoring system to track eye movements (www.iscaninccom).

fMRI-guided array targeting

Request a detailed protocol

Functional MRI studies were carried out in both monkeys to localize face patches along the lower bank of the superior temporal sulcus. For scanning, all monkeys were alert, and their heads were immobilized using a foam-padded helmet with a chinstrap that delivered juice. The monkeys were scanned in a primate chair that allowed them to move their bodies and limbs freely, but their heads were restrained in a forward-looking position by the padded helmet. The monkeys were rewarded with juice during scanning. Monkeys were scanned in a 3 T TimTrio scanner with an AC88 gradient insert using 4-channel surface coils (custom made by Azma Mareyam at the Martinos Imaging Center). Each scan session consisted of 10–12 functional scans. We used a repetition time (TR) of 2 s, echo time (TE) of 13 ms, flip angle of 72°, iPAT = 2, 1 mm isotropic voxels, matrix size 96 × 96 mm, 67 contiguous sagittal slices. To enhance contrast (Leite et al., 2002; Vanduffel et al., 2001), we injected 12 mg/kg monocrystalline iron oxide nanoparticles (Feraheme, AMAG Pharmaceuticals, Cambridge, MA) in the saphenous vein just before scanning. Responses to image categories of faces and inanimate objects were probed. Brain regions that responses more strongly to faces than objects were identified along the lower bank of the STS (p<0.0001, FDR-corrected). Arcaro and Livingstone, 2017 for additional details.

After array implantation, Computed Tomography (CT) scans (0.5 × 0.5×1.25 mm) were collected in both monkeys. The base of the FMA arrays and wires were visible on both CT images. The base of the NeuroNexus array was not clearly visible on the CT image. Each monkey’s CT image was spatially aligned to its MPRAGE anatomical image. Because brain/skull contrast is opposite between CT and MPRAGE MRI images, the two volumes were aligned by manually creating a binary brain mask for both CT and MPRAGE images and rigidly aligning the brain masks. The resulting spatial transformation matrix was applied to bring the CT and MPRAGE images into alignment. The location of the FMA arrays were then compared with the fMRI-defined PL and ML face patches.

Electrophysiology display

Request a detailed protocol

We used MonkeyLogic to control experimental workflow (https://monkeylogic.nimh.nih.gov). Visual stimuli were displayed on an 19’ Dell LCD screen monitor with a 60 Hz refresh rate and a 4:3 aspect ratio positioned 54 cm in front of each monkey.

Receptive field mapping

Request a detailed protocol

To map the location of each recording site’s receptive field, we flashed small (~2°x2°) images of faces or round non-face objects at different locations relative to the fixation point in a 16°x16° grid with 1–2° spacing centered at the fixation point. Each image was presented for 100 ms ON and 200 ms OFF. Responses were defined as the mean firing rate over 80–250 ms after picture onset minus the mean firing rate over the first 30 ms after image onset. Responses were averaged across image repetitions. A 2-dimensional gaussian was fit to each array channel’s 9 × 9 response grid to derive the centroids and response widths (1 sigma). We define the population RF as the mean array response. To characterize tuning of each recording site, images of isolated faces, hands, bodies, and objects on a white background were presented within the activating region of all the visually-responsive sites. Each image subtended 4° and was presented for 100 ms ON and 200 ms OFF. Responses were defined as the mean firing rate over 80–250 ms after image onset minus the mean firing rate over the first 30 ms after image onset. Responses were averaged across image repetitions. Category selectivity was assessed with an index (category1-response –category2-response)/ (category1-response + category2-response) (Baker et al., 1981). For computing the index, channels with negative responses were excluded. There were no channels with negative responses in the PL array in Monkey 1. For the ML arrays in both monkeys, we excluded 1 channel each from the body vs objects index due below baseline responses to bodies.

Main Experiment stimuli

Request a detailed protocol

Stimuli comprised 16° x 16° natural scene photographs embedded in a brown-noise image that filled the entire monitor screen. Each photo could contain humans, monkeys, other animals, or artificial objects (Figure 7—figure supplement 3). Several of the photos were modified to remove the face or to replace the face with non-face objects or pixelated noise. Photos were grouped into several image sets. The first image set comprised unaltered natural scenes containing humans and monkeys. The second image set comprised images of people wearing masks. The third image set contained images where faces were occluded with objects. The fourth image set comprised images where objects in each scene were pasted over the faces. The fifth image set comprised 6 people and one monkey with 8 variations for a total of 56 images. For each person/monkey, the face was replaced with either pixelated noise (i.e. where face pixels were replaced by random 3-color channel pixel values), an outline, another object in the scene, or left intact. Each face manipulation was presented either above a body or without the body. The sixth image set comprised 7 people and 1 monkey each at one of 8 orientations at 45° intervals for a total of 64 images. The seventh image set comprised 6 people and 1 monkey with 4 variations for a total of 28 images. For each person/monkey, the face was presented either intact or replaced with pixelated noise and either in upright or inverted orientation. For each experiment, pictures were presented over a 17 x 17 point grid (spacing 1°) for Monkey 1 and a 9 × 9 point grid (spacing of 2°) for Monkey 2 with the image centered on the center of the population receptive field. Generally, Monkey 2 did not work for as long as Monkey 1 so a coarser grid was necessary in order to get enough repetitions of all stimuli at each position. The difference in grid sampling is unlikely to be a major factor as results for all experiments were consistent between monkeys. Further, the 2° spacing of this 9 × 9 grid remained smaller than the response width of the RF mapping and was sufficient for detecting spatially specific activity. Images were presented for 100 ms ON and 200 ms OFF. Three images were presented per trial (Figure 1A). At the start of each trial, the fixation target appeared, and the animal had up to 5 s to direct its gaze to the fixation target. If the animal held fixation until the end of the trial, a juice reward was delivered at 100 ms after the last (third) image’s offset period. Every picture was presented at each grid position 5–10 times. In Monkey 1, PL and ML arrays were recorded from simultaneously.

Experimental session workflow

Request a detailed protocol

Data were collected across 4 months in M1 and 1 month in M2. Each day, the animal would be head-fixed and its implant(s) connected to the experimental rig. First, the animal’s gaze was calibrated with a 9-point grid using a built-in MonkeyLogic routine. We used the Plexon Multichannel Acquisition Processor Data Acquisition System to collect electrophysiological information, including high-frequency (‘spike’) events, local field potentials, and other experimental variables, such as reward rate and photodiode outputs tracking monitor frame display timing. Each channel was auto-configured daily for the optimal gain and threshold; we collected all electrical events that crossed a threshold of 2.5 SDs from the mean peak height of the distribution of electrical signal amplitudes per channel. These signals included typical single- and multi-unit waveforms.

Spike data preparation

Request a detailed protocol

The raw data comprised event (‘spike’) times per channel for the entire experimental session. In Monkey 1, 32 channels were recorded simultaneously from both the posterior and middle face patch arrays for the main experiment. The same channels were recorded from for the entire duration of this study.

Analysis

For each image, trials were sorted based on grid position. Trial repetitions were averaged to derive the mean activity when each grid position was centered within the mean activating region of each array. The mean activity was visualized over each image with pixels between grid points linearly interpolated. Activity overlapping pixels within intact or manipulated faces were averaged and plotted as post stimulus time histograms (smoothed using a gaussian kernel over 20 ms sliding window). This activity was compared with the mean activity over pixels of other parts of the image. Significance was assessed across images with paired, two-way t-tests, corrected for a false discovery rate adjusted p value of 0.05.

Convolution neural network image analysis

Request a detailed protocol

240 greyscale images evenly distributed across 6 categories (human faces, headless human bodies, animals, houses, tools, and inanimate objects) were analyzed. Each category image was presented on a uniform white background. Images were analyzed using the MATLAB implementation of a pre-trained convolutional neural network, AlexNet. Network activations across relu layers of AlexNet were computed for each image. To measure the representational similarity across images within a layer, Spearman rank correlations were computed between z-scored activations for each image pair. To visualize the representational structure of each layer, multidimensional scaling was performed on squared Euclidean distances between the z-scored activation patterns of image pairs. To probe distinctions purely based on image shape, correlations and multidimensional scaling were also performed on the vectorized 3 color-channel pixel intensities for each image. Permutation testing was performed by randomizing the labels corresponding of non-face images prior to computing correlations between face and non-face categories. Mean correlations and the 2.5% and 97.5% percentiles across 2500 randomizations were computed for each layer of AlexNet and for pixel intensity. To look at the effect of training, an additional permutation test was performed where the weights within each layer were shuffled prior to measuring the activations for each image and computing image pair correlations. This approach simulates an untrained network while preserving the distribution of weights found in the trained network. The mean correlations and the 2.5% and 97.5% percentiles across 500 permutations were computed for each layer of AlexNet.

Quantification and statistical analyses

Request a detailed protocol

For activity maps in Figures 17, Figure 1—figure supplements 2, 4 and 5, Figure 2—figure supplement 1, Figure 3—figure supplements 13, Figure 4—figure supplements 13, Figure 5—figure supplements 13, Figure 6—figure supplements 13, and Figure 7—figure supplement 2 spike data were averaged across trial repetitions. For PSTH graphs in Figures 17 and Figure 4—figure supplement 4, the mean response and 95% confidence intervals across images were calculated. For each 1 ms bin, two-sample, paired t-tests were performed across images on the firing rates between the target and control regions. The resulting p values were adjusted for false discovery rates. Given, the relatively small number of observations going into each t-paired t-test, significant effects were verified further with nonparametric (Wilcoxon signed-rank) tests.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
    Visual objects in context
    1. M Bar
    (2004)
    Nature Reviews Neuroscience 5:617–629.
    https://doi.org/10.1038/nrn1476
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
    ImageNet Classification with Deep Convolutional Neural Networks
    1. A Krizevsky
    2. I Sutskever
    3. GE Hinton
    (2012)
    25th International Conference on Neural Information Processing Systems.
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
    Looking at upside-down faces
    1. RK Yin
    (1969)
    Journal of Experimental Psychology 81:141–145.
    https://doi.org/10.1037/h0027474

Decision letter

  1. Tatiana Pasternak
    Reviewing Editor; University of Rochester, United States
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The finding that the representation of faces in inferotemporal cortex depends on the location of the animal's fixation and is context specific was judged as novel and exciting. The elegant and unique design of the study allowed the authors to show that face-selective neurons respond not just to faces in isolation but also to parts of an image that do not contain a face but should do so, based on the context (proximity to the body). The work provided convincing evidence that the representation of complex objects in the brain reflects statistical regularities of the experienced environment.

Decision letter after peer review:

Thank you for submitting your article "The neurons that mistook a hat for a face" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The authors report a series of recording experiments in macaque face patches (PL and ML) that examined the response of face responsive neuronal sites to natural images that contain bodies but with the faces occluded or obscured. Although the faces were absent or only partially visible, the face patch ML neurons tended to respond at the image location where the face was supposed to be. They conclude that face neurons of the face patch ML are sensitive to contextual influences and experienced environmental regularities. This paper reports novel data on the response properties of face neurons in face patch ML: responses to the location of a face when the latter is absent or occluded by another object but suggested by the context of a body. The experimental paradigm is also quite interesting and unique in this field: presenting a complex natural image at different locations and then analyzing the response as a function of the location or hotspot of the receptive field of the neuron as measured with a single face stimulus.

Essential revisions:

The manuscript was well received by both reviewers who found the work exciting and the results novel. However, they raised a number of issues, mainly concerned with the need for more detailed description of the data to give the readers a more rigorous impression of the observed effects. To achieve this goal, the paper needs better quantification of the prevalence of the effect and its presence in individual neurons in the recorded population. To help the authors, the reviewers provided a detailed list of questions, requests and suggestions, which must be addressed before the manuscript will be consider for publication in eLife. These are summarized below.

Reviewer 1:

1) Please clarify the data in Figures 2 and 3 by providing cell numbers for each plot and for each recorded region. In addition, provide information about stimulus selectivity of the recorded neurons and discuss the relationship between this selectivity and responses to faces shown in these figures.

2) Were the RFs showing responses to images in Figure 1—figure supplement 1, also mapped with small dots?

3) Please clarify the composite body orientation maps shown in Figure 6.

Reviewer 2:

1) Please discuss some of the discrepancies between the data and their description in the text outlined in detail by this reviewer. Also, please address the question whether scaling maps to the minimum and maximum response for each image prevents detecting responses at other locations. Also, in Figures 5 and 6B please show maps for the images, in addition to the PSTHs.

2) Differences in activity between different recording sites need to be discussed. A related question is the definition of non-face regions. This reviewer makes specific suggestions concerning Figure S6, which you may find helpful in dealing with the problem.

3) Please discuss responses to the legs for inverted bodies shown in Figure 6 and whether these multi-unit responses reflect mixed selectivity with the recorded patch. Showing responses of well-isolated units would help addressing this possibility.

4) Long latencies of responses to occluded head images cast doubt on the interpretation proposed in the Discussion that rules out top-down explanation of the effect. In addition, the proposed role of experience is hypothetical and should be presented as such.

5) This reviewer also made several points that should be addressed in the body of the paper or in the Discussion. These include, clarifications of mapping RFs with dots and the possibility that their size was underestimated, dealing with negative response values when computing the index, enlarging the size of Figures 4 and S6 for better visibility and correction of the legend for Figure 4.

6) Please comment on the validity of using parametric t-tests to evaluate the effects and whether the assumptions underlying these tests are fulfilled.

Reviewer #1:

This study demonstrates that neurons in two macaque face patches, ML and PL, respond in a retinopically specific way to face-related features of a scene. Surprisingly, these face-related responses are often elicited even when the basic structural features of a face are occluded, such that only the bodily cues indirectly indicate the presence of a face in the neuron's response field.

1) This is a very exciting study that has makes two important points. The first, which is not featured, perhaps because it is "sort of" known, is that for neurons in these face patches the animal's fixation position is extremely important. Responses to faces or "unseen" faces are wholly contingent on the face being present on a limited portion of the retina. I thought this aspect of the study is very important for a community that does not typically think about "face cells" in this way. The second aspect, which is featured more strongly, is that neurons respond in this retinotopically specific way to occluded faces. This quite a remarkable finding. The authors do some interesting extensions and controls, including showing a surprising lack of influence by adding the body to the visible face. This study provides much food for thought.

2) My main criticism of the paper is that I don't have a clear picture of how representative the featured effect is. Figures 2 and 3 show heat maps, and refer to "maps of cells". Are these population maps (as in the bottom panels of Figure 1—figure supplement 3), or are they selected responses of individual cells? There is a statement early on that everything is population analysis, but the authors need to do a better job explaining how many cells go into each of the plots, since the reader needs to understand to what extent this behavior is typical of neurons in ML and PL. Having only one PL recording site also significantly weakens this aspect.

3) Relatedly, since neurons differ in their specific preferences, it's surprising that there is minimal discussion of this. The authors should do more to provide information about how stimulus selectivity (aside from positional selectivity) interacts with their main findings.

Reviewer #2:

The authors report a series of recording experiments in macaque face patches (PL and ML) that examined the response of face responsive neuronal sites to natural images that contain bodies but with the faces occluded or obscured. Although the faces were absent or only partially visible, the face patch ML neurons tended to respond at the image location where the face was supposed to be. They conclude that face neurons of the face patch ML are sensitive to contextual influences and experienced environmental regularities.

This paper reports novel data on the response properties of face neurons in face patch ML: responses to the location of a face when the latter is absent or occluded by another object but suggested by the context of a body. The experimental paradigm is also quite interesting and unique in this field: presenting a complex natural image at different locations and then analyzing the response as a function of the location or hotspot of the receptive field of the neuron as measured with a single face stimulus. The main result agrees with previous fMRI studies in human and monkeys that showed responses in face areas / patches to blurred faces on top of a body, but the present study examines this in some more detail at the spiking activity level in the posterior face patches.

In general, I found this a highly interesting and exciting paper but I have some concerns and comments that need to be addressed, in particular with respect to the presentation and interpretation of these remarkable data.

1) The results are not as clear-cut as the impression one gets from reading the text of the paper. Examples of such issues are the following. Figure 2: Monkey 1 ML: second column image (human carrying a box): in the occluded condition there is also strong activation in the top right corner where the box is. In Figure 3, third and fifth column: the location of the masked face and the hotspot of neural activity do not match well in the data of M2. Furthermore, for the 7th column image of Figure 3 (3 persons carrying boxes) there is a clear hotspot in both monkeys only for the rightmost (third) person: why not for the closer other two persons? In the 9th column image, there is an additional hotspot in the data of M2 on the body below the covered head. Also striking is the variability in effects between images and monkeys. This is apparent in the data of Figure 4: second column (little response for ball covering the head in M1), fourth column (no or little response to non-face object on face), fifth column (no response to covered face in M1). Also, in general the data of M2 appear to be more "smeared out" compared to those of M1: is that related to RF size differences, neuron-to-neuron variability or the mapping with larger steps in M2? The authors should discuss these discrepancies between the data and what they describe in the text in detail, acknowledging that these cells do not always respond at the location of the occluded head or sometimes even not close to the occluded head. One related problem concerning the data presentation is that the maps are scaled to the minimum and maximum response for each image. I wonder how the maps will look like when shown across the full possible range of net responses for a site (i.e. from 0 net response to maximum response): was there no response at other locations than (or close to) the (occluded) head/face? Also, related to this point, the authors should show the maps for the images of Figure 5 and 6B – not just the PSTHs (see next main point). In other words, these interesting data should be presented with greater detail.

2) The authors show only averages across electrodes per animal (except for the responses to unoccluded faces (Figure S3)) and one wonders how strong the differences are between the different neuron sites. To remedy this, the authors can show the maps for each image in Figure S6 (but please make the images larger) so that the reader can assess the (or lack of) consistency of the responses for the different images. The PSTHs in the current figures are averages across the images so the presented confidence intervals give some indication of the variability for the different images. However, what was unclear to me is how the authors defined the non-face region? They should indicate the latter in Figure S6 and describe in detail how they define the non-face region: centered on an object, equated for contrast/luminance with the face/occluded face, same area, etc? Choosing a valid non-face region does not seem trivial to me for these complex images, and such a choice may affect the responses shown in the PSTHs. Finally, the authors can compute for each site and image an index contrasting the net response to the face and control, non-face region and show the distribution of such indices for each experiment.

3) In the experiment of Figure 5, the authors compare the responses to a modified face alone and the modified face on top of a body and find stronger responses when the body is present. However, they should have as a control condition also the body without face, otherwise the increased response could have been due to a (weak) response to the body itself (summation of obscured face and body response). The absence of such summation effect for full faces could be due to a ceiling effect, as suggested by the authors.

4) The strong responses to the legs for the inverted body in Figure 6 is surprising to me. It clearly shows that the responses of these neurons to body stimuli without a head are more complex than just a response to an occluded face or absent head. The authors suggest that this is related to responses to short contours in the face cells. But if that is the case, can it not explain also the responses to the rectangular shapes (with also short contours) occluding the head in the images of Figure 3? Also, when it is related to short contours, why not then comparable strong responses to the images with a head (upper row of Figure 6)? It is clear to me that we do not understand the responses to these neurons. Also, how consistent was this leg response across the different sites? It also makes one wonder whether some of these "weird" responses are because the recordings are multi-unit, which will pool the responses of different neurons, some of which may not be face selective. In other words, it could be that the single neurons that respond to the legs are not face selective at all but mixed with face selective neurons in this patch. It would have been informative to show the responses of well-isolated single units in this case (and also in the other experiments).

5) The authors suggest in the Discussion that the responses to the occluded head are not due merely to expectation. However, the latencies of the responses to the occluded head appear to be longer than to a real face (and, in some cases, the responses were (as for the non-face) preceded by inhibition (Figure 2, 3, 4 and 5 in M1)). Thus I feel one cannot rule out an expectation or top-down explanation of the effects.

6) That the responses to the occluded heads result from experience makes sense but is speculation and not demonstrated in the present experiments. This should be made explicit.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your article "The neurons that mistook a hat for a face" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: David Leopold (Reviewer #1).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

Summary:

The revised manuscript addressed many of the issues raised in the initial review. However, one of the reviewers raised additional concerns and made several suggestions that must be addressed for the manuscript to be accepted for publication. These are listed below.

Essential revisions:

1) Please provide the distribution of indices to allow the reader better assess the variability of the data.

2) Please create an additional figure that compares directly PSTHs recorded in the ML and PL face patches (see reviewer 2 comment #3)

3) The claim that the changes in face-body representations at successive layers of Alexnet simulations result from learning, requires additional analysis of category representations in an untrained Alexnet using random weights.

4) Please tone down the interpretation of the importance of the eyes when referring to Figure 2 in the second paragraph of the Results section.

Reviewer #1:

The authors have done an outstanding job addressing my comments, clarifying a few bits and then showing many single-cell examples of this striking phenomenon in the supplementary material. There are many important aspects of this paper beyond the headline, including the spatial specificity of face selective neurons, as well as the similar but delayed responses in ML compared to PL. I'm happy to recommend publication.

Reviewer #2:

The authors performed an extensive revision, with new interesting data and modelling. They responded to all my concerns and overall I am happy with their revision. Nonetheless, I still have a couple of concerns that should be addressed by the authors.

1) I understand that it is not possible to show the responses of all units to all stimuli, given the size of the data. I asked to show distribution of indices in order to get an idea about the across image and unit variability but the authors argue that this is not practical because of the variability in response latency amongst images and units. Such argument can always be made and then one should refrain from using indices (but the authors compute face and body indices). I think that when one uses net responses to compute indices, the variability in latency should not be a main cause of variance among indices of different neurons or images. One can employ an analysis window that captures the population response, computed across images and neurons, per experiment. The confidence intervals that are provided now depend on the number of observations; although they show the consistency of the mean population response, the CIs are not a straightforward metric of the variability among sites or images. Also, it is not clear whether the individual site examples shown in the figure supplements are representative of the observed diversity of response profiles. Hence, I still believe that distributions of indices will help to appreciate the variability among images and sites.

2) The authors provide novel data on the responses to images in which the eye is invisible. However, in the images shown in Figure 2, either more than the eyes are removed (top image) and it is unclear how these other changes affected the response to the disfigured face, or, the eyes are occluded by a visible occluder (bottom image). The latter is known to increase the response latency of IT neurons. To assess whether removing the eye from the image affects the response, it would have been better to present a face without eyes, the eye region being filled in with the neighboring texture, so that no additional changes (holes or bar occluder) are present. With the current images, I believe the authors should be more careful in interpreting the response and latency differences between the full face and the images in which the eyes were removed or occluded.

3) The authors suggest that the responses to the occluded faces do not result from top-down processing, using as argument that the response latency differences between ML and PL remain for the occluded faces, which is a reasonable argument. However, PL and ML responses are compared directly only for the intact faces and the "invisible eye" images, but not for the face occlusion conditions. One way to show this is to compare in one figure the population PSTHs between ML and PL for the different conditions.

4) They write, when describing the novel and interesting Alexnet simulations, that "The Euclidean distances between image pairs within each category were smaller than between categories, indicating that images were more homogeneous within than between categories and highlighting that simple shape is a major component of these commonly tested "semantic" categories". However, the images also differ in texture and it is possible that not only shape but also texture is driving the category differences in Alexnet. In fact, some recent studies stressed the importance of texture cues for categorization in deep convolutional neural networks.

5) The authors employ the Alexnet simulation to demonstrate that representations of faces and bodies can merge over learning. However, to show that learning is critical they should perform a similar analysis of the category representations in an untrained Alexnet, e.g. with random weights.

6) The authors write "that convolutional networks can capture the representational structure of biological networks." It all depends on what is meant by "capture" here, but there is still a discrepancy between IT and classical feedforward deep net (like Alexnet) representations, e.g. see recent work by DiCarlo's group. I suggest to tone down this statement on the correspondence between deep nets and ventral stream representations.

7) Can the authors explain why they permute the labels of the non-face images when computing the between face, non-face correlations (inset; Figure 8B); permutation of the non-face labels in this case should not make a difference, or perhaps I am missing something here.

8) The authors should provide the explained variance of the 2D MDS configuration shown in their novel Figure 8A, so that the reader can appreciate how the distances plotted in 2D correspond to the original distances in the high-dimensional feature space.

9) The authors write in their Discussion "More broadly, these results highlight.… that the facilitation of face-selective neurons in the absence of a clear face reflects a lifetime of experience that bodies are usually accompanied by faces.". As I noted before, there is no evidence yet that the contextual effects reported by the authors do result from experience during the lifetime. Please rephrase.

10) Figure 1—figure supplement 1: the authors should indicate the t-values (with the corrected p = 0.05 threshold) of the face-object contrast in the fMRI maps.

11) Figure 1—figure supplement 3: can the authors explain the above-baseline responses in M1, ML before 50 ms after stimulus onset – one would expect that soon after stimulus onset responses equal to baseline because of the response latency of the neurons.

https://doi.org/10.7554/eLife.53798.sa1

Author response

Essential revisions:

The manuscript was well received by both reviewers who found the work exciting and the results novel. However, they raised a number of issues, mainly concerned with the need for more detailed description of the data to give the readers a more rigorous impression of the observed effects. To achieve this goal, the paper needs better quantification of the prevalence of the effect and its presence in individual neurons in the recorded population. To help the authors, the reviewers provided a detailed list of questions, requests and suggestions, which must be addressed before the manuscript will be consider for publication in eLife. These are summarized below.

We thank the reviewers for their thorough reviews and positive assessment. We provide a point-by-point response to each of the reviewers’ comments. In summary, we made the following revisions to our manuscript:

1) We have made extensive edits to the text to address reviewer comments.

2) We provide additional quantifications of response consistencies and statistical analyses as requested.

3) We have added 12 additional supplementary figures (Figure 3—figure supplements 1-3, Figure 4—figure supplements 1-3, Figure 5—figure supplements 1-3, Figure 6—figure supplements 1-2, and Figure 7—figure supplement 2) that show more individual channel and population-level responses maps to images reported in the main text figures.

4) Using data we previously collected, but had not reported, we have added 1 new main text figure (Figure 2) and 3 additional supplementary figures (Figure 1—figure supplement 6, Figure 2—figure supplement 1, and Figure 6—figure supplement 3) that show response maps to additional images that illustrate the selectivity and latency of responses.

5) We have added a new analysis of category representations using an artificial neural network (Figure 8).

6) We have added a new supplementary figure illustrating the localization of the arrays to the face patches in IT.

Reviewer 1:

1) Please clarify the data in Figures 2 and 3 by providing cell numbers for each plot and for each recorded region. In addition, provide information about stimulus selectivity of the recorded neurons and discuss the relationship between this selectivity and responses to faces shown in these figures.

Data in Figures 3 and 4 (previously Figures 2 and 3) are population activity (averaged across all responsive channels within an array). We have clarified this in the figure legends as well as in Figures 5 and 7. Previously, we showed individual channel data from Monkey 1 and Monkey 2 only for Figure 1 in Figure 1—figure supplement 4. Now, we include supplementary figures that show 3 individual channels each for Monkey 1 and Monkey 2 for the maps in Figure 3 (Figure 3—figure supplement 3), Figure 4 (Figure 4—figure supplement 3), and Figure 5 (Figure 5—figure supplement 3). To better convey the consistency of responses across channels, we now also report the average Pearson correlation between each channel and the n-1 population average (i.e. leaving out that channel) for data in Figures 3 – 5.

2) Were the RFs showing responses to images in Figure 1—figure supplement 1, also mapped with small dots?

We apologize for this confusion. That was a mistake in the text. The RFs in Figure 1—figure supplement 2 (formerly Figure 1—figure supplement 1) were mapped with 2 degree faces and round non-face images. We now show examples of the faces and round non-face images in Figure 1—figure supplement 2 and report the mean and standard deviations of RF sizes across channels for each array in the legend. In a separate mapping session (not reported in manuscript) we did perform RF mapping in Monkey M1 using face features such as eyes and simplified versions of eyes (small dots and small concentric circles). Similar to the mapping with 2º faces and objects, the centers of those RFs were within the central visual field, but the RFs were even smaller (mean RF 1.41 +/- 0.24 STD for Monkey 1 ML). The RF centers were in similar locations using these stimuli. We did not perform this mapping with face features in monkey M2. For the purposes of our complex natural image recordings, only the center of the RF is critical for the analysis. If anything, larger RFs would reduce the resolution of focal activity and minimize the effects observed in our main experiments. The fact that activity was focal to faces in natural images demonstrates that the RFs were not so large as to be a major constraint on our results.

3) Please clarify the composite body orientation maps shown in Figure 6.

The composite orientation maps serve as summary data illustrating the best body orientation at each grid point. We feel this is important to include given the variability in response magnitudes across both body orientations and spatial location. This is especially true for upright vs. inverted headless bodies as shown more clearly in Figure 7B. Because of this, one cannot look at any individual orientation to know what the preferred tuning is for that point in space. As requested by the reviewers, we have made the color scale and corresponding body orientation images on the right side of Figure 7A (formerly Figure 6) larger to improve readability. We also changed the colormap for the composite orientation images to make them more distinct from the response maps so readers would be less likely to confuse those data for firing rates. We have included the following explanation of these composite maps in the Results:

“The magnitude of firing rate varied across the tested visual field location (e.g., firing rates tended to be higher above the upper half of the image). Composite maps show the preferred body orientation for each spatial position tested (Figure 7A, right side). In experiments both with and without heads, preferred orientation systematically varied such that the preferred body orientation at that spatial location positioned the face, or where the face ought to be, within the RF center.”

Reviewer 2:

1) Please discuss some of the discrepancies between the data and their description in the text outlined in detail by this reviewer.

Reviewer #2 made several comments noting the variability of the activity maps and appearance of responses to natural scenes that are not clearly driven by the body. We address this above in main comment #1 as well as below with reviewer #2’s specific comments.

Also, please address the question whether scaling maps to the minimum and maximum response for each image prevents detecting responses at other locations.

In our initial submission we chose a min to max response scaling to illustrate the full range of responses. However, large outliers can bias the appearance of response maps scaled between minimum and maximum values. To minimize the influence of outliers and to better center the color scale range to the values for each map, we now show every response map scaled between the 0.5 and 99.5% percentiles of each map. This has the effect of increasing the effective range for the bulk of each response map. We do not think this affects the interpretation of the response maps. We also looked at using 1% to 99% and 2.5 to 97.5% ranges, but found that 0.5% to 99.5% best illustrated the range of values within activity maps. To clarify, the minimum response for Monkey 1 is often slightly below baseline as can be seen in the PSTH responses to non-face regions. For Monkey 2, the minimum response was close to 0 with some images slightly below and others slightly above.

As the reviewer noted above, for any individual image, there are sometimes hot spots (yellow / red) outside the face area (or where the face ought to be). We are not claiming that these neurons respond only to the face stimuli; see for example Tsao et al., 2006 showing that face cells consistently respond to clocks and round fruits. Indeed, we also see consistent responses particularly to round, tan objects such as cookies. See Author response image 1. It seems likely that at least some of these features drive these neurons because they share shape, color, or contrast features with faces. There may be other features that consistently drive these neurons across images. Relatively few studies have systematically tested tuning of face patch neurons to features embedded within naturalistic stimuli. This is certainly an interesting avenue of exploration, but beyond the scope of the current study. Here, we specifically study the effect of bodies positioned below face selective neuron RFs on firing rate

Author response image 1
Response maps from M1 to natural images without faces.

Hotspots are consistently found over circular shapes and/or above body-life structures.

Also, in Figures 5 and 6B please show maps for the images, in addition to the PSTHs.

We now show response maps for each stimulus in Figures 6 and 7B (previously Figures 5 and 6B). See Figure 6—figure supplements 1 and 2 and Figure 7—figure supplement 2.

2) Differences in activity between different recording sites need to be discussed. A related question is the definition of non-face regions. This reviewer makes specific suggestions concerning Figure S6, which you may find helpful in dealing with the problem.

The response selectivity across the arrays in both animals was consistent. In Figure 1—figure supplement 4, we provided a measure of the similarity (Pearson correlations) between each channel and the n-1 population average for responses to natural images with faces. We now provide examples of individual channels for the population-level response maps shown in Figure 3 (Figure 3—figure supplement 3), Figure 4 (Figure 4—figure supplement 3), and Figure 5 (Figure 5—figure supplement 3). To better convey the consistency of responses across channels, we now also report the average Pearson correlation between each channel and the n-1 population average for data in Figures 3– 5.

We did not see mixed selectivity in any of the arrays. We now provide indices for bodies vs. objects and faces vs. all other images for each responsive channel in all 3 recording sites in Figure 1—figure supplement 3. We find little to no body selectivity. For the experiments reported in Figures 3-5, we also now provide the mean correlation and variance between the spatial pattern of response maps for each channel and the n-1 group average within the ML arrays.

3) Please discuss responses to the legs for inverted bodies shown in Figure 6 and whether these multi-unit responses reflect mixed selectivity with the recorded patch. Showing responses of well-isolated units would help addressing this possibility.

We show in Figure 1—figure supplement 3 that these neurons responded to faces, but did not respond to bodies, when the bodies were presented within the receptive field. We find no evidence of mixed selectivity in our multi-unit RSVP data (e.g., strong body or hand selectivity). Reviewer #2 raised the concern that our multi-unit data comprises a predominant pool of face-selective neurons and a weaker pool of body-selective neurons. Such an interpretation is not consistent with our results. First, images of bodies did not significantly activate these neurons when presented within the RF (Figure 1—figure supplement 3). We now provide body-selectivity indices showing that there was little to no body selectivity in these arrays. Across our experiments, hot spots of activity are consistently above the body, not on it. A mixed-selectivity explanation would need the RFs of face-selective neurons to be systematically right above the RFs of body-selective neurons in both monkeys. However, such an account is not consistent with our results reported in Figure 6 (also see new Figure 6—figure supplements 1 and 2 for response maps for every image in this experiment). We show that the response to an intact face alone or above a body is near identical in magnitude. If the face and body responses were separate neuronal populations, these responses would have to summate and the face + body response would need to be larger than the face without body. Therefore, the only parsimonious explanation is that our multi-unit data comprised face-selective neurons that respond to a face presented within the RF or to a body presented below the RF. Upon consideration of the reviewers’ comments, we agree that the link to end-stopping was to speculative. In the results, we now limit our interpretation that the response to feet for inverted headless bodies may be a manifestation of this body-below responsiveness.

4) Long latencies of responses to occluded head images cast doubt on the interpretation proposed in the Discussion that rules out top-down explanation of the effect. In addition, the proposed role of experience is hypothetical and should be presented as such.

We understand the reviewer’s concern regarding longer latencies. We do not agree that long latencies cast doubt on our interpretation, nor that it necessarily suggests a top-down explanation. In fact, the latencies of these neurons can vary simply by changing the size of a face or adding/removing backgrounds. To better illustrate the dependency of latencies on the visual input, we now include two new figures (Figure 2 and Figure 2—figure supplement 1) showing the response latencies of PL and ML simultaneously recorded in Monkey 1 to natural face images with and without eyes. By simply removing the eyes from faces, we see a substantial shift in the latencies for both PL and ML. The ML response latencies for these images without eyes is comparable to the latencies of the masked and occluded faces. Importantly, PL still responds to the eyeless faces with a shorter latency than ML. We do not think this suggests top-down modulation, rather that these neurons are sensitive to the input of a variety of visual features that can manifest at different timescales. For example, we used a stimulus with parts of a face spatially separated (top of Figure 2). For both PL and ML, the response to eyes emerged in time windows earlier than the response to the surrounding face. This is consistent with prior findings in PL by Issa and DiCarlo (JN 2012; Figure 12).

In the discussion, we propose that contextual representations such as a body below a face can manifest locally within IT through experience. We do not claim that we have shown that experience is necessary but think this is quite plausible. To support this interpretation, we now provide an analysis of category representations using an artificial neural network trained on image classification (Figure 8). We show that representations of faces and bodies in an artificial neural network (AlexNet) trained to classify images into a set of 1000 categories (but not explicitly trained on face, body, or human categorization!) start off separate and merge closer to one another in deep layers of the architecture more than with other categories. We show this with multidimensional scaling visualizations of the distances between images for activations in early and deep layers of AlexNet (Figure 8A). We also show that the overlap in face and body representations is greatest for deep layers (relu6 and relu7) of AlexNet (Figure 8B). This provides some evidence that the regularity of experiencing faces and bodies is sufficient for overlapping representations to emerge in a hierarchical system.

5) This reviewer also made several minor points that should be addressed in the body of the paper or in the Discussion. These include, clarifications of mapping RFs with dots and the possibility that their size was underestimated, dealing with negative response values when computing the index, enlarging the size of Figures 4 and S6 for better visibility and correction of the legend for Figure 4.

We have clarified the issue of RF mapping above and addressed the concern about RF size.

There were no channels with negative (below baseline) responses to faces or non-face objects in the PL array in Monkey 1. There were no negative face or non-face object responses in ML of either monkey but one channel in each of the ML arrays had a below-baseline response to bodies, so those channels were not used in calculating the average body vs. object indices. We clarify this in the results and Figure 1—figure supplement 3’s legend.

6) Please comment on the validity of using parametric t-tests to evaluate the effects and whether the assumptions underlying these tests are fulfilled.

We acknowledge that a relatively small number of observations (images) went into each test, which can be problematic for parametric tests. To alleviate the reviewer’s concerns, we re-ran all the statistical analyses using non-parametric Wilcoxon tests. The results were nearly identical and did not change any interpretation of the results. We provide in Author response image 2 an illustration of the parametric vs. non-parametric results (both threshold at p < 0.05 FDR-corrected) for all PSTH graphs with significant results.

Author response image 2
Comparison of t-test results with non-parametric Wilcoxon signed-rank test.

Both threshold at p< 0.05 FDR-corrected.

Reviewer #1:

This study demonstrates that neurons in two macaque face patches, ML and PL, respond in a retinopically specific way to face-related features of a scene. Surprisingly, these face-related responses are often elicited even when the basic structural features of a face are occluded, such that only the bodily cues indirectly indicate the presence of a face in the neuron's response field.

1) This is a very exciting study that has makes two important points. The first, which is not featured, perhaps because it is "sort of" known, is that for neurons in these face patches the animal's fixation position is extremely important. Responses to faces or "unseen" faces are wholly contingent on the face being present on a limited portion of the retina. I thought this aspect of the study is very important for a community that does not typically think about "face cells" in this way. The second aspect, which is featured more strongly, is that neurons respond in this retinotopically specific way to occluded faces. This quite a remarkable finding. The authors do some interesting extensions and controls, including showing a surprising lack of influence by adding the body to the visible face. This study provides much food for thought.

We thank the reviewer for this encouraging assessment.

2) My main criticism of the paper is that I don't have a clear picture of how representative the featured effect is. Figures 2 and 3 show heat maps, and refer to "maps of cells". Are these population maps (as in the bottom panels of Figure 1—figure supplement 3), or are they selected responses of individual cells? There is a statement early on that everything is population analysis, but the authors need to do a better job explaining how many cells go into each of the plots, since the reader needs to understand to what extent this behavior is typical of neurons in ML and PL. Having only one PL recording site also significantly weakens this aspect.

We apologize for this confusion. All main text figures show population data – all visually responsive sites in each array. Figure 1—figure supplement 4 showed a few examples of response maps from a single channel for comparison with the population map for Monkeys 1 and 2. We focus on the population data because the selectivity from a RSVP paradigm (Figure 1—figure supplement 3) was similar across channels within each array and the response maps of individual channels (on average) to complex images containing faces were highly similar to the n-1 group average, so therefore the population response maps were representative of consistent activity across individual channels.

To give a better sense of the individual channel variability, we now include graphs of body- and face-selective indices for each array (Figure 1—figure supplement 3) and include new supplemental figures for Figures 3-5 (previously Figures 2-4) showing response maps for 3 individual channels each for both Monkeys 1 and 2. We also now report the similarity of response maps across channels within each array for the experiments in Figures 3-5.

We agree that only having a single PL recording array from one monkey limits what we can conclude about PL in general. Overall face-selectivity and the responses to occluded faces were weaker in PL so we focused on ML. We still feel that it is useful to report our PL results here, given that in this monkey we recorded simultaneously from PL and ML (which to our knowledge has not been reported previously). To make better use of the PL data, we now include a figure illustrating the latency differences between PL and ML from simultaneous recordings. For each natural image, the face response to PL preceded ML. Interestingly, when we occluded the eyes, the latencies for both PL and ML increased, but PL face responses still preceded ML. Though PL showed only weak responses to masked and occluded faces, we still saw a similar latency difference (Figure 2—figure supplement 2). We take this as additional evidence that the body-facilitation effect is not due to top-down feedback. Implanting a PL array in Monkey 2 is not an option at this time, so adding another PL monkey now would require training up a third monkey, targeting, and implanting PL, which would take months.

3) Relatedly, since neurons differ in their specific preferences, it's surprising that there is minimal discussion of this. The authors should do more to provide information about how stimulus selectivity (aside from positional selectivity) interacts with their main findings.

We did not see much heterogeneity in tuning across each of our arrays. We attribute this homogeneity in each array to targeting face patches using fMRI. In Figure 1—figure supplement 3, we show, using an independent stimulus set of isolated objects, that all visually responsive channels selectively responded to faces (as compared to bodies, hands, and objects). We now provide body- and face- selectivity indices for each channel. The degree of face selectivity varied across channels, but all channels were much more responsive to faces than to any other category. These neurons may vary in their preference for particular facial features. We did not test for this though all channels were particularly responsive to eyes. There was some variability in the specific tuning e.g., some face-selective channels in Monkey 2 had a small preference for human faces over monkey faces, and other channels in Monkey 2 preferred the opposite. We did not thoroughly test other dimensions that face cells are known to vary along such as viewpoint preference, and our complex stimuli did not substantially vary along this dimension. To better illustrate the similarity in response selectivity across channels, we now show response maps for 3 individual channels for Figures 3 – 5 (previously Figures 2 – 4).

Reviewer #2:

[…]

1) The results are not as clear-cut as the impression one gets from reading the text of the paper. Examples of such issues are the following. Figure 2: Monkey 1 ML: second column image (human carrying a box): in the occluded condition there is also strong activation in the top right corner where the box is. In Figure 3, third and fifth column: the location of the masked face and the hotspot of neural activity do not match well in the data of M2. Furthermore, for the 7th column image of Figure 3 (3 persons carrying boxes) there is a clear hotspot in both monkeys only for the rightmost (third) person: why not for the closer other two persons? In the 9th column image, there is an additional hotspot in the data of M2 on the body below the covered head. Also striking is the variability in effects between images and monkeys. This is apparent in the data of Figure 4: second column (little response for ball covering the head in M1), fourth column (no or little response to non-face object on face), fifth column (no response to covered face in M1). Also, in general the data of M2 appear to be more "smeared out" compared to those of M1: is that related to RF size differences, neuron-to-neuron variability or the mapping with larger steps in M2? The authors should discuss these discrepancies between the data and what they describe in the text in detail, acknowledging that these cells do not always respond at the location of the occluded head or sometimes even not close to the occluded head.

The reviewer is correct that responses were variable across individual stimuli. That is to be expected from recordings in IT cortex. Critically, as we show for each and every experiment, across stimuli responses to faces, or where the face ought to be, are on average stronger than to non-face parts of these images. For each experiment, we show averaged PSTHs along with confidence intervals showing the variability across images and perform statistical testing to convey the reliability of these responses.

As the reviewer noted, it is interesting that for images with multiple faces, some hot spots were more prominent than others. We have been testing the effect of clutter on face responses more directly in a separate set of experiments, but this is beyond the scope of the current study.

Regarding Monkey 2’s activity maps. We note in the Materials and methods and in describing the data reported in Figure 1 that the step size used for mapping Monkey 2’s response was double that of Monkey 1. This was a limitation due to the length of time Monkey 2 would work on a given day. Importantly, the spatial spread of the response maps due to this larger step size did not preclude finding statistically significant responses to bodies below the receptive field, as reported for all seven experiments. As shown in Figure 1—figure supplement 4, the response maps were similar across channels in Monkey 2’s array, therefore it is unlikely that channel variability had a major effect of smearing the response maps. The receptive field sizes across Monkey 2’s ML array were similar to Monkey 1’s (Figure 1—figure supplement 2). For large RFs, we would certainly expect more spatial spread. If anything, large RFs would work against/prevent the spatially specific effects we find.

We agree with the reviewer that this variability across individual stimulus response maps should be acknowledged. In each section we now include the following clarification:

For Figure 3:

“The magnitude and timing of responses to occluded faces varied across stimuli. Across all stimuli, the population level activity to occluded faces was reliably higher than the average response to the rest of each image (t-test across 8 images; p < 0.05, FDR-corrected).”

For Figure 4:

“The population-level responses to masked faces varied in magnitude and timing but were consistently stronger than average responses to the rest of each image (t-test across 12 images; p < 0.05, FDR-corrected).”

For Figure 5:

“Similar to the occluded and masked faces, population-level responses to these face-swapped images varied in magnitude and latency but responses to non-face objects were consistently stronger when positioned above a body than when positioned elsewhere (t-test across 8 and 15 images for M1 and M2, respectively; p < 0.05, FDR corrected).”

One related problem concerning the data presentation is that the maps are scaled to the minimum and maximum response for each image. I wonder how the maps will look like when shown across the full possible range of net responses for a site (i.e. from 0 net response to maximum response): was there no response at other locations than (or close to) the (occluded) head/face? Also, related to this point, the authors should show the maps for the images of Figure 5 and 6B – not just the PSTHs (see next main point). In other words, these interesting data should be presented with greater detail.

We address this comment in the Essential Revisions section above.

2) The authors show only averages across electrodes per animal (except for the responses to unoccluded faces (Figure S3)) and one wonders how strong the differences are between the different neuron sites. To remedy this, the authors can show the maps for each image in Figure S6 (but please make the images larger) so that the reader can assess the (or lack of) consistency of the responses for the different images.

We performed statistical analyses for each experiment to quantify the reliability of effects across test images.

Across all experiments we showed 222 images. Each of the 3 arrays recorded 32 channels. We do not think it is reasonable to provide a supplementary figure with 21,312 images (made large). For each experiment we show the mean responses and confidence intervals across all images and provide appropriate statistics to demonstrate reliable responses.

We now provide several additional supplemental figures to illustrate:

1) the variability of the response maps across time (Figure 1—figure supplement 5, Figure 2—figure supplement 1, Figure 3—figure supplement 1, Figure 3—figure supplement 2, Figure 4—figure supplement 1, Figure 4—figure supplement 2, Figure 5—figure supplement 1, Figure 5—figure supplement 2)

2) the variability across channels (Figure 3—figure supplement 3, Figure 4—figure supplement 3, Figure 5—figure supplement 3).

3) population-level response maps for every image reported in Figure 6 (Figure 6—figure supplement 1 and 2) and Figure 7B (Figure 7—figure supplement 2).

The PSTHs in the current figures are averages across the images so the presented confidence intervals give some indication of the variability for the different images. However, what was unclear to me is how the authors defined the non-face region? They should indicate the latter in Figure S6 and describe in detail how they define the non-face region: centered on an object, equated for contrast/luminance with the face/occluded face, same area, etc? Choosing a valid non-face region does not seem trivial to me for these complex images, and such a choice may affect the responses shown in the PSTHs.

For Figures 1, 3 and 4, non-face responses were defined as the mean response across all non-face parts of the image. These regions are indicated as grey in Figure 7—figure supplement 3. We acknowledge that the nonface region averages responses over a wide variety of visual features and choosing a non-face region for comparison is not trivial. Because of this, we performed several additional experiments (reported in Figures 5, 6 and 7B) that provide analyses that have well controlled comparisons. For Figure 5, the response of nonface objects atop a body is compared to responses to the identical objects occurring at another part of the scene. For Figure 6, the PSTHs within each of the 4 face groups (intact, noise, outline, and object) were taken from the same region of the image and comprised identical pixels. The only aspect of the image that varied was the presence or absence of a body below this region of interest. For Figure 7B, the pixels for each image analyzed are identical. The only thing that varies is whether the image was upright or inverted.

Finally, the authors can compute for each site and image an index contrasting the net response to the face and control, non-face region and show the distribution of such indices for each experiment.

The consistency and timecourse of responses to face vs. non-face regions across images in Figures 1-4 is shown in the PSTHs and evaluated statistically. For the mapping experiments, we avoided making such indices because that would require averaging across a set time window, and the latencies were variable across stimuli and experiments. We now provide several additional figures that highlight the variable timecourses of responses (Figure 1—figure supplement 5, Figure 2, Figure 2—figure supplement 1, Figure 3—figure supplement 1, Figure 3—figure supplement 2, Figure 4—figure supplement 1, Figure 4—figure supplement 2, Figure 5—figure supplement 1, Figure 5—figure supplement 2). We feel that a more objective approach is to show the actual responses and provide statistics on the consistency of response across images at each millisecond bin.

3) In the experiment of Figure 5, the authors compare the responses to a modified face alone and the modified face on top of a body and find stronger responses when the body is present. However, they should have as a control condition also the body without face, otherwise the increased response could have been due to a (weak) response to the body itself (summation of obscured face and body response). The absence of such summation effect for full faces could be due to a ceiling effect, as suggested by the authors.

The point of this study is that face-selective cells do give a weak response to bodies, but only when a body is presented below the receptive-field center. We do show responses to headless bodies in Figures 6 and 7. The experiment reported in Figure 6 (previously Figure 5) demonstrates that a body positioned below the RF can influence activity in the absence of an intact face regardless of what is within the RF – whether a face outline, face-like shape (noise), or even non-face object. The PSTHs comprise activity only within the face region above the body, not when the body is in the RF. The difference between each colored line in Figure 6 and the black line is sufficient to demonstrate the effect of a body being present below the RF. We now show responses to intact heads, face-shaped noise, face outlines, and objects with and without bodies (Figure 6—figure supplement 1). We also now show the response maps for each image in this experiment for both monkeys (Figure 6—figure supplements 1 and 2).

The question of whether these neurons would respond when a body is presented below a receptive field by itself is an interesting question. We already show those effects in Figure 7A. Additionally, we now show response maps for several images we tested during a pilot experiment that include headless mannequins and headless bodies (Figure 6—figure supplement 3). In each image, increased activity is apparent above each body.

4) The strong responses to the legs for the inverted body in Figure 6 is surprising to me. It clearly shows that the responses of these neurons to body stimuli without a head are more complex than just a response to an occluded face or absent head. The authors suggest that this is related to responses to short contours in the face cells. But if that is the case, can it not explain also the responses to the rectangular shapes (with also short contours) occluding the head in the images of Figure 3? Also, when it is related to short contours, why not then comparable strong responses to the images with a head (upper row of Figure 6)? It is clear to me that we do not understand the responses to these neurons. Also, how consistent was this leg response across the different sites? It also makes one wonder whether some of these "weird" responses are because the recordings are multi-unit, which will pool the responses of different neurons, some of which may not be face selective. In other words, it could be that the single neurons that respond to the legs are not face selective at all but mixed with face selective neurons in this patch. It would have been informative to show the responses of well-isolated single units in this case (and also in the other experiments).

For the experiment reported in Figure 7A, the responses are focused on and in-between the feet, but only when the feet are above a body. While this response was consistent across stimuli in this experiment, the ML arrays were generally unresponsive to feet in the other experiments. e.g., the responses around the feet are weaker than above the body in the bowling image in Figure 5, the monkey’s foot (except when photoshopped over the body) in Figure 6—figure supplements 1 and 2, the people carrying wood in Figure 4, the monkeys in Figure 4, the man carrying boxes in Figure 3, and the guy with the orangutan in Figure 1. As discussed in the results, we think the responses to the feet in Figure 7A are likely due to the peculiar situation that in the inverted body situation the feet are above a body. We showed in Figure 6 that pretty much any object above a body will elicit a response. The monkey object with body image in Figure 6—figure supplement 2 is a particularly clear example that whatever the base firing rate to a leg is, positioning it over the body increases its firing rate.

More broadly, the reviewer is concerned whether our effects could be the result of mixed selectivity from multi-unit data. We address the concern of mixed selectivity in the Essential Revisions section above. That said, we agree with the reviewer that the responses of these neurons are more complex than simply responding to an intact (or even occluded) face. We see robust activity to non-face objects – for example, we see hot spots on objects that have features similar to faces e.g., round, concentric circles or the tops of body-like structures (see Author response image 1). This does not diminish our findings. Our experiments were specifically aimed at probing the influence of a body below these neurons’ RFs. In particular the experiments reported in Figures 5, 6, and 7B target the effect of placing a body below the RF regardless of tuning properties to other visual features.

5) The authors suggest in the Discussion that the responses to the occluded head are not due merely to expectation. However, the latencies of the responses to the occluded head appear to be longer than to a real face (and, in some cases, the responses were (as for the non-face) preceded by inhibition (Figure 2, 3, 4 and 5 in M1)). Thus I feel one cannot rule out an expectation or top-down explanation of the effects.

An expectation or top-down account is hard to reconcile with our data. Stimuli were shown for only 100ms. Though latencies to occluded faces were longer than to an intact face, preferential responses to occluded faces were statistically significant by ~120-140ms, which seems too short for ‘top-down expectation’. Instead, we have now added to the discussion the idea that such ‘expectation’ could be broad enough to include the idea that the inputs to face cells become sculpted by experience to respond to things that are commonly seen with faces. Furthermore, the latency differences between PL and ML are also inconsistent with a strong topdown account. We now provide a new Figure 2 that demonstrates how the latencies vary in both PL and ML depending on the features present in the image, but PL always responds to intact of face features prior to ML. The fact that the eyeless faces have longer latencies (Figure 2) indicates these neurons likely have complex spatiotemporal RFs that are not driven by top-down expectation. We also now provide an analysis of face and body images using a hierarchical artificial network trained on image classification. We find that the representations of human faces and human bodies in deeper layers are more overlapping than the representations of faces and other categories (Figure 8). This network was not explicitly trained on faces, bodies or humans and thus demonstrates that representations of faces and bodies can merge over learning without any explicit training about faces and bodies.

To further explain our reasoning, we now provide the following Discussion:

“The idea that the context of a visual scene leads to the expectation, enhanced probability, or prior, of seeing some particular item is usually thought to involve top-down processes. However, if the feedforward inputs to, say, face-selective neurons, are sculpted by experience to be responsive not only to faces, but also to things frequently experienced in conjunction with faces, then such expectations would be manifest as the kind of complex response properties reported here.”

6) That the responses to the occluded heads result from experience makes sense but is speculation and not demonstrated in the present experiments. This should be made explicit.

We have softened our Discussion to make it clearer that our results are consistent with an interpretation that this contextual coding is dependent on experience without claiming that we have tested this. To bolster this hypothesis, we provide modeling results demonstrating how face and body representations can merge in a hierarchical system from statistical regularities of experience (even when not explicitly trained on those features).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #2:

The authors performed an extensive revision, with new interesting data and modelling. They responded to all my concerns and overall I am happy with their revision. Nonetheless, I still have a couple of concerns that should be addressed by the authors.

1) I understand that it is not possible to show the responses of all units to all stimuli, given the size of the data. I asked to show distribution of indices in order to get an idea about the across image and unit variability but the authors argue that this is not practical because of the variability in response latency amongst images and units. Such argument can always be made and then one should refrain from using indices (but the authors compute face and body indices). I think that when one uses net responses to compute indices, the variability in latency should not be a main cause of variance among indices of different neurons or images. One can employ an analysis window that captures the population response, computed across images and neurons, per experiment. The confidence intervals that are provided now depend on the number of observations; although they show the consistency of the mean population response, the CIs are not a straightforward metric of the variability among sites or images. Also, it is not clear whether the individual site examples shown in the figure supplements are representative of the observed diversity of response profiles. Hence, I still believe that distributions of indices will help to appreciate the variability among images and sites.

We now provide response histograms for both monkeys in Figures 3-6. We use the same time window as used for the response maps for each monkey. We show the distribution of responses across channels for individual images as well as the distribution of channel responses for the mean across images. Each graph shows that most channels show greater responses to the region above the body vs. comparison regions. The only condition we did not see this effect was for the face with body vs. face without body comparison in Figure 6 (leftmost graph), which, as previously discussed, did not show a significant difference in the population responses across images.

2) The authors provide novel data on the responses to images in which the eye is invisible. However, in the images shown in Figure 2, either more than the eyes are removed (top image) and it is unclear how these other changes affected the response to the disfigured face, or, the eyes are occluded by a visible occluder (bottom image). The latter is known to increase the response latency of IT neurons. To assess whether removing the eye from the image affects the response, it would have been better to present a face without eyes, the eye region being filled in with the neighboring texture, so that no additional changes (holes or bar occluder) are present. With the current images, I believe the authors should be more careful in interpreting the response and latency differences between the full face and the images in which the eyes were removed or occluded.

The reviewer raises a good point that in Figure 2, the images either had more than the eye removed or are presented with a visible occluder. We agree that the particular manner in which the eyes are removed from an image could affect neural responses. In Figure 2—figure supplement 1, we showed as the reviewer requested, an image in which the eye region was filled in with the neighboring texture (third image from the top). We took the face texture from sides of the monkey face and filled in the eye region. A similar shift in the latencies is evident in this image. We modified the main text to directly address this:

“A similar shift in response latencies was evident when replacing the eyes with a visible occluder, the background of the image, and surrounding texture of the face (Figure 2—figure supplement 1).”

3) The authors suggest that the responses to the occluded faces do not result from top-down processing, using as argument that the response latency differences between ML and PL remain for the occluded faces, which is a reasonable argument. However, PL and ML responses are compared directly only for the intact faces and the "invisible eye" images, but not for the face occlusion conditions. One way to show this is to compare in one figure the population PSTHs between ML and PL for the different conditions.

While we do not plot the responses to PL and ML on the same graph for face occlusion conditions, the difference in responses is apparent when comparing the graphs in Figures 3-5 to Figure 4—figure supplement 4 where responses occur prior to 100ms in PL and after 100ms in ML. We include as Author response image 3 a direct comparison of these PL and ML responses in monkey 1. Darker responses are ML and fainter responses are PL.

Author response image 3

4) They write, when describing the novel and interesting Alexnet simulations, that "The Euclidean distances between image pairs within each category were smaller than between categories, indicating that images were more homogeneous within than between categories and highlighting that simple shape is a major component of these commonly tested "semantic" categories". However, the images also differ in texture and it is possible that not only shape but also texture is driving the category differences in Alexnet. In fact, some recent studies stressed the importance of texture cues for categorization in deep convolutional neural networks.

We agree with the reviewer that texture may also play an important role. We have modified the text to include reference to texture differences:

“The Euclidean distances between image pairs within each category were smaller than between categories, indicating that images were more homogenous within than between categories and highlighting that low-level features like shape, color, or texture are a major component of these commonly tested “semantic” categories (also see Dailey and Cottrell, 1999).”

5) The authors employ the Alexnet simulation to demonstrate that representations of faces and bodies can merge over learning. However, to show that learning is critical they should perform a similar analysis of the category representations in an untrained Alexnet, e.g. with random weights.

As requested, we now show that the similarity of faces and bodies in AlexNet is indeed specific to the trained network and not present in untrained networks. We simulated untrained networks by shuffling the weights of the trained AlexNet in each layer. This preserves the distribution of weights and allows for more direct comparisons with the trained network. We calculated image pair correlations on 500 permutations of the shuffled weights. In a new supplementary figure (Figure 8—figure supplement 1, we show that the mean and 97.5% percentiles of face-body correlations across permutations are below the face-body correlations in the trained network. Also, these correlations were no different than the correlations between faces and other categories (black vs. grey shaded regions in Figure 8—figure supplement 1).

6) The authors write "that convolutional networks can capture the representational structure of biological networks". It all depends on what is meant by "capture" here, but there is still a discrepancy between IT and classical feedforward deep net (like Alexnet) representations, e.g. see recent work by DiCarlo's group. I suggest to tone down this statement on the correspondence between deep nets and ventral stream representations.

We agree with the reviewer that convolutional nets such as AlexNet differ from the ventral stream in fundamental ways and it is erroneous to equate the two. We did not intend to draw such strong equivalencies. The second half of that quoted sentence highlights some of the ways in which these two differ. We agree with the reviewer that differences in representational structure also have been shown between deep nets and IT. We simply intended to convey that deep nets such as AlexNet can capture some of the representational structure such as these differences between image categories in order to demonstrate the effect of the statistics of experience on representations. This is consistent with DiCarlo’s (and others) work. As the reviewer suggested we have toned down this statement to avoid confusion:

“Though convolutional networks can capture some aspects of the representational structure of biological networks (Hong et al., 2016; Ponce et al., 2019),”

7) Can the authors explain why they permute the labels of the non-face images when computing the between face, non-face correlations (inset; Figure 8B); permutation of the non-face labels in this case should not make a difference, or perhaps I am missing something here.

Permuting all labels of non-face images including bodies and non-bodies tests whether correlations between faces and bodies are part of the same distribution of correlations for faces and other non-face, non-body images. By showing that the face-body correlations are stronger than 97.5% of permuted correlations, this indicates that the representations of faces and bodies are more overlapping than faces and other non-face, non-body images.

8) The authors should provide the explained variance of the 2D MDS configuration shown in their novel Figure 8A, so that the reader can appreciate how the distances plotted in 2D correspond to the original distances in the high-dimensional feature space.

We have added the variance explained to each of the 3 MDS plots. While the first 2 dimensions capture the majority of the variance for pixel intensity and relu7 (72% and 74%). The variance explained by the first two dimensions was substantially lower for relu1 (39%) and increased moving to deeper layers. We are using MDS as a visualization. Importantly, the correlation analyses reported in 8B are performed on the full dimensional feature spaces. We include as Author response image 4 the variance explained across all relu layers for 2D MDS.

Author response image 4

9) The authors write in their Discussion "More broadly, these results highlight.… that the facilitation of face-selective neurons in the absence of a clear face reflects a lifetime of experience that bodies are usually accompanied by faces." As I noted before, there is no evidence yet that the contextual effects reported by the authors do result from experience during the lifetime. Please rephrase.

We have modified the text to more clearly indicate that we infer this:

“More broadly, these results highlight that neurons in inferotemporal cortex do not code objects and complex shapes in isolation. We propose that these neurons are sensitive to the statistical regularities of their cumulative experience and that the facilitation of face-selective neurons in the absence of a clear face reflects a lifetime of experience that bodies are usually accompanied by faces.”

10) Figure 1—figure supplement 1: the authors should indicate the t-values (with the corrected p = 0.05 threshold) of the face-object contrast in the fMRI maps.

We now include the t and p values for the fMRI maps:

“Data threshold at t(2480) > 3.91 (p < 0.0001 FDR-corrected) and t(1460) > 3.91 (p < 0.0001 FDR-corrected) for monkeys 1 and 2, respectively.”

11) Figure 1—figure supplement 3: can the authors explain the above-baseline responses in M1, ML before 50 ms after stimulus onset – one would expect that soon after stimulus onset responses equal to baseline because of the response latency of the neurons.

We thank the reviewer for catching this. The responses were baseline-subtracted, but the axis labels were incorrect. We’ve fixed the graph to correct this. We confirm that the other response graphs were not incorrectly labeled.

https://doi.org/10.7554/eLife.53798.sa2

Article and author information

Author details

  1. Michael J Arcaro

    Department of Psychology, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    marcaro@sas.upenn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4612-9921
  2. Carlos Ponce

    Department of Neuroscience, Washington University in St. Louis, St. Louis, United States
    Contribution
    Conceptualization, Software, Visualization, Methodology
    Competing interests
    No competing interests declared
  3. Margaret Livingstone

    Department of Neurobiology, Harvard Medical School, Boston, United States
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared

Funding

National Institutes of Health (RO1 EY 16187)

  • Margaret Livingstone

National Institutes of Health (P30 EY 12196)

  • Margaret Livingstone

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank A Schapiro for advice on analyses and helpful comments on the manuscript. This work was supported by NIH grants RO1 EY 16187 and P30 EY 12196.

Ethics

Animal experimentation: All procedures were approved in protocol (1146) by the Harvard Medical School Institutional Animal Care and Use Committee (IACUC), following the Guide for the care and use of laboratory animals (Ed 8). This paper conforms to the ARRIVE Guidelines checklist.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Tatiana Pasternak, University of Rochester, United States

Publication history

  1. Received: November 20, 2019
  2. Accepted: May 21, 2020
  3. Version of Record published: June 10, 2020 (version 1)

Copyright

© 2020, Arcaro et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,664
    Page views
  • 254
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Neuroscience
    Mattia Aime et al.
    Research Article

    Survival depends on the ability of animals to select the appropriate behavior in response to threat and safety sensory cues. However, the synaptic and circuit mechanisms by which the brain learns to encode accurate predictors of threat and safety remain largely unexplored. Here, we show that frontal association cortex (FrA) pyramidal neurons of mice integrate auditory cues and basolateral amygdala (BLA) inputs non-linearly in a NMDAR-dependent manner. We found that the response of FrA pyramidal neurons was more pronounced to Gaussian noise than to pure frequency tones, and that the activation of BLA-to-FrA axons was the strongest in between conditioning pairings. Blocking BLA-to-FrA signaling specifically at the time of presentation of Gaussian noise (but not 8 kHz tone) between conditioning trials impaired the formation of auditory fear memories. Taken together, our data reveal a circuit mechanism that facilitates the formation of fear traces in the FrA, thus providing a new framework for probing discriminative learning and related disorders.

    1. Neuroscience
    Vanessa Lage-Rupprecht et al.
    Research Article

    In the rodent olfactory bulb the smooth dendrites of the principal glutamatergic mitral cells (MCs) form reciprocal dendrodendritic synapses with large spines on GABAergic granule cells (GC), where unitary release of glutamate can trigger postsynaptic local activation of voltage-gated Na+-channels (Navs), that is a spine spike. Can such single MC input evoke reciprocal release? We find that unitary-like activation via two-photon uncaging of glutamate causes GC spines to release GABA both synchronously and asynchronously onto MC dendrites. This release indeed requires activation of Navs and high-voltage-activated Ca2+-channels (HVACCs), but also of NMDA receptors (NMDAR). Simulations show temporally overlapping HVACC- and NMDAR-mediated Ca2+-currents during the spine spike, and ultrastructural data prove NMDAR presence within the GABAergic presynapse. This cooperative action of presynaptic NMDARs allows to implement synapse-specific, activity-dependent lateral inhibition, and thus could provide an efficient solution to combinatorial percept synthesis in a sensory system with many receptor channels.