1. Computational and Systems Biology
Download icon

The bottom-up and top-down processing of faces in the human occipitotemporal cortex

  1. Xiaoxu Fan
  2. Fan Wang
  3. Hanyu Shao
  4. Peng Zhang
  5. Sheng He  Is a corresponding author
  1. Institute of Biophysics, Chinese Academy of Sciences, China
  2. University of Chinese Academy of Sciences, China
  3. University of Minnesota, United States
Research Article
  • Cited 0
  • Views 2,250
  • Annotations
Cite this article as: eLife 2020;9:e48764 doi: 10.7554/eLife.48764

Abstract

Although face processing has been studied extensively, the dynamics of how face-selective cortical areas are engaged remains unclear. Here, we uncovered the timing of activation in core face-selective regions using functional Magnetic Resonance Imaging and Magnetoencephalography in humans. Processing of normal faces started in the posterior occipital areas and then proceeded to anterior regions. This bottom-up processing sequence was also observed even when internal facial features were misarranged. However, processing of two-tone Mooney faces lacking explicit prototypical facial features engaged top-down projection from the right posterior fusiform face area to right occipital face area. Further, face-specific responses elicited by contextual cues alone emerged simultaneously in the right ventral face-selective regions, suggesting parallel contextual facilitation. Together, our findings chronicle the precise timing of bottom-up, top-down, as well as context-facilitated processing sequences in the occipital-temporal face network, highlighting the importance of the top-down operations especially when faced with incomplete or ambiguous input.

Introduction

There is ample evidence to show that the processing of face information involves a distributed neural network of face-sensitive areas in the occipitotemporal cortex and beyond (Duchaine and Yovel, 2015; Haxby et al., 2000). Three bilateral face-selective areas are considered as the core face-processing system, defined in functional Magnetic Resonance Imaging (fMRI) studies as regions showing significantly higher response to faces than objects, which are Occipital Face Area (OFA) in the inferior occipital gyrus (Gauthier et al., 2000; Haxby et al., 1999), Fusiform Face Area (FFA) in the fusiform gyrus (Kanwisher et al., 1997; Grill-Spector et al., 2004) and a face-sensitive area in the posterior superior temporal sulcus (pSTS) (Hoffman and Haxby, 2000; Puce et al., 1998). Similarly, a number of so-called face patches have been identified in macaque monkeys along the superior temporal sulcus (Tsao et al., 2003; Tsao et al., 2006; Tsao et al., 2008). Although the functional properties of these areas have been studied extensively, we do not yet have a comprehensive understanding of how the face-processing network functions in a dynamic manner. Hierarchical models postulate that face specific processes are initiated in the OFA based on local facial features, then the information is forwarded to higher level regions, such as FFA, for holistic processing (Haxby et al., 2000; Fairhall and Ishai, 2007; Liu et al., 2002). This model is supported by neuroimaging studies showing functional properties of face-selective areas and is consistent with generic local-to-global views of object processing. However, it has been challenged by results from studies in which patients with damaged OFA can still showed FFA activation to faces (Rossion et al., 2003; Steeves et al., 2006). Further, it was reported that during the perception of faces with minimal local facial features, FFA could still show face-preferential activation without face-selective inputs from OFA (Rossion et al., 2011). Thus a non-hierarchical model was proposed postulating that face detection is initiated at the FFA followed by a fine analysis in the OFA (Rossion et al., 2011; Gentile et al., 2017). These competing models may reflect different modes of operation of the face network under different demands. To reconcile these models, a comprehensive dynamic picture of face processing under different conditions with more detailed temporal information is needed.

In the current study, we investigated the dynamics of face processing in the ‘core face processing system’ using Magnetoencephalography (MEG) and fMRI. We designed the face-related stimuli specifically to reveal mechanisms for processing 1) normal faces, 2) Mooney faces with very little explicit facial features, 3) distorted faces with internal facial features spatially misarranged, and 4) contextually induced face representations with internal facial features completely missing. During the experiment, subjects were presented with various types of face pictures while MEG signals were recorded. The key effort in this study was in reconstructing the source signals from the MEG sensor data, to obtain a dynamic depiction of cortical responses to faces and other types of stimuli. With the timing of activation revealed in each face-selective area in the ‘core face processing system’, we could uncover when and where face information is processed in the human brain.

The main findings are briefly summarized here. First, we revealed the basic, mainly bottom-up, processing sequence along ventral temporal cortex by presenting face pictures of famous individuals to subjects. Face processing was initiated in the posterior areas and then proceeded forward to anterior regions. Right OFA (rOFA) and right posterior FFA (rpFFA) were activated very close in time, peaking around 120 ms, while right anterior FFA (raFFA) reached its peak at about 150 ms. The right pSTS (rpSTS) in the dorsal pathway showed a weaker and temporally more variable response, participating in face processing within a time window from 130 to 180 ms. Then, we highlighted the top-down operation in face processing by using two-tone Mooney face images (Mooney, 1957) lacking prototypical local facial features. According to the predictive coding theory (Rao and Ballard, 1999; Murray et al., 2004; Mumford, 1992), face prediction created at FFA based on impoverished information of Mooney faces and prior knowledge is poorly matched with the input representation at OFA due to the lack of explicit local facial features. The activity in OFA, representing ‘residual error’ between top-down prediction and bottom-up input, is then expected to increase subsequently. Consistent with this model, rOFA was activated later than rpFFA, and rpFFA exerted extensive directional influence onto rOFA when processing Mooney faces, suggesting a cortical analysis dominated by rpFFA to rOFA projection. However, when explicit internal facial features were available but misarranged within a normal face contour, a temporal pattern similar to that of normal faces was observed. Finally, we further investigated the temporal dynamics when face-specific responses were driven by contextual cues alone with the internal face features entirely missing (Cox et al., 2004). In this case, rOFA, rpFFA and raFFA were activated somewhat late and almost simultaneously, corresponding to contextual modulation that parallelly facilitated the processing of the core face-processing network.

Results

Face induced MEG signals in the source space

Subjects were presented with famous faces and familiar objects and instructed to perform a simple classification task (face or object) while their brain activity was recorded using MEG. After a rest period, each subject was scanned with fMRI viewing the same group of face and object images presented in separate blocks. Since each subject underwent both fMRI and MEG measurements, we could compare the face-selective regions defined by fMRI with the reconstructed MEG signals evoked by faces in the source space.

Subjects’ face-selective regions in the occipitotemporal cortex were localized with fMRI contrasting responses to faces with that to objects. MEG signals at different time points were reconstructed in the source space by computing LCMV beamformer solution on evoked data after preprocessing (Van Veen et al., 1997). The estimated activities for the whole cortical surface can be viewed as a 3D spatial distribution of LCMV value (power normalized with noise) at each time point (Sekihara and Nagarajan, 2008).

Figure 1 shows the fMRI identified face regions and MEG measured face-evoked signals in a typical subject, displayed in ventral and lateral views of an inflated right hemisphere (Source localization results and fMRI localization results are shown in Figure 1—figure supplement 14 for more individual subjects). Face-selective regions rOFA, rpFFA, raFFA and rpSTS were identified by fMRI localizer (Figure 1A). MEG responses evoked by faces are shown in 10 ms steps from 120 ms to 160 ms in source space (cortical surface) (Figure 1B). It could be seen in the MEG signal that the location of a cluster of activation in the right occipital cortex at about 120 ms after stimulus onset is consistent with rOFA. At about 150–160 ms, a cluster of activation was found in posterior part of superior temporal sulcus, overlapping with rpSTS. Two temporally separated clusters of MEG source activation were found in the right fusiform gyrus, one consistent with the location of pFFA (about 130 ms) and another with aFFA (about 150 ms) (see Video 1). Similar spatiotemporal patterns of activation could be seen across the 13 subjects tested. These results show that face response areas identified by MEG are highly consistent with that defined by fMRI, thus it is a reasonable approach to extract the MEG time courses based on fMRI-guided region of interest (ROI). In this paper, with the understanding that the sources of MEG signals were constrained by the fMRI defined ROIs, we use the fMRI terms (OFA, FFA and pSTS) to indicate the corresponding cortical area in MEG data.

Figure 1 with 4 supplements see all
Face-selective areas identified by fMRI localizer and face-evoked MEG source activation displayed on an inflated right hemisphere of a typical subject.

(A) Face-selective statistical map (faces>objects) showing four face-selective regions (rOFA, rpFFA, raFFA and rpSTS). (B) Face-evoked MEG source activation patterns represented as LCMV value maps at different time points (120-160 ms) after the stimulus onset. LCMV values represent signal power normalized by noise.

Video 1
MEG activation of a typical subject.

Bottom-up processing sequence induced by normal faces

We investigated the typical dynamic sequence for processing faces in the ventral occipitotemporal cortex investigated by presenting subjects with face images of well-known individuals. We analyzed the time courses of face-selective areas identified in the source space. Seven face-selective areas (lOFA, rOFA, lpFFA, rpFFA, raFFA, lpSTS, rpSTS) were identified, guided by fMRI face localizer results from each individual subject, and they were used to extract the face-response time courses of the MEG source data. We averaged the resulting time courses across subjects and the waveforms are shown in Figure 2A. Face images induced stronger responses compared to objects in face-selective areas, especially for the right hemisphere. The timing of peak responses for individual ROIs are summarized in Figure 2B and C, revealing the fundamental temporal characteristics of the neural processing of faces. In the right hemisphere, face-evoked responses emerged earlier in the posterior areas than in the anterior areas, the peak responses occurred at 116 ± 6 ms, 125 ± 5 ms and 150 ± 10 ms for rOFA, rpFFA and raFFA, respectively. Although there is no significant difference between rOFA and rpFFA (t12 = 1.57, p=0.43, Bonferroni corrected), the peak response timing of raFFA is significantly delayed compared with rpFFA (t11 = 3.21, p=0.025, Bonferroni corrected), suggesting a bottom-up process. Similarly, OFA reached its peak response earlier than pFFA in the left hemisphere (lOFA:122 ± 5 ms, lpFFA:126 ± 6 ms), although this trend is not statistically significant (t11 = 0.64, p>0.05, Bonferroni corrected). Responses from the left anterior FFA was not shown because the corresponding activation cluster was not observed clearly in most subjects. In addition, dorsal face-selective region pSTS showed weaker and temporally broader responses, involved in face processing roughly from 130 to 180 ms. The sequential progression from posterior to anterior regions along the ventral occipitotemporal cortex, especially the significantly delayed activation of raFFA, indicates a bottom-up hierarchical functional structure of the ventral face pathway.

Figure 2 with 1 supplement see all
Temporal response characteristics of face-selective ROIs.

(A) The time courses of face (solid line) and object (dotted line) induced responses averaged across subjects, for the seven face-selective ROIs. Shaded area means SEM. The green bar indicates significant difference between face and object. Significance was assessed by cluster-based permutation test (cluster-defining threshold p<0.05, significance level p<0.05) for each ROI. (B) The peak latency averaged across subjects for each ROI (mean ± SEM). The peak latency of raFFA is significantly later than rpFFA (t11 = 3.21, p=0.025, Bonferroni corrected) (C) The mean peak latencies for the face-selective ROIs were shown on inflated cortical surfaces of both hemispheres at corresponding locations.

In addition to famous faces, we also presented unfamiliar faces to subjects and analyzed the data in the same way. Results showed essentially similar hierarchical dynamic sequences of face processing regardless of face familiarity (Figure 2—figure supplement 1). Thus, unfamiliar face images were used in the next experiment reported below.

Top-down operation in face processing highlighted by viewing Mooney faces

While the processing of normal (famous or unfamiliar) faces mainly followed the posterior to anterior (bottom-up) face processing sequence, we further investigated the possibility that under certain stimulus conditions, top-down modulation of face processing could become more prominent. According to the predictive coding theory, when the representation of sensory input in lower areas is poorly matched with the predictions generated from higher level areas, the activity in lower areas representing residual error would be increased (Rao and Ballard, 1999; Murray et al., 2004; Mumford, 1992). Hence, we adopted the two-tone Mooney face images (Figure 3A), which could be recognized as faces but lack prototypical local facial features, as the main stimuli in this experiment. Our hypothesis was that when processing Mooney faces which could activate the FFA based on the global configuration, the top-down modulation from FFA to OFA (prediction of facial parts) would be more prominent.

Temporal response characteristics and granger causality analysis for face-selective ROIs during perception of Mooney and normal faces.

(A) Normal and Mooney face images. (B) The peak latency averaged across subjects for each face-selective ROI (mean ± SEM). Mooney faces elicited a response with significantly longer latency in rOFA than normal faces (paired t test, t23 = 4.009, p=0.001). (C) Time courses averaged across subjects for bilateral OFA and pFFA. Gray line is OFA and red line is pFFA. Shaded areas denote SEM. The circles above time courses represent peak latencies of individual subjects. rOFA was engaged significantly later than rpFFA when processing Mooney faces (Paired permutation test p=0.02. Bonferroni corrected). (D) Granger causality analysis performed within a series of 50 ms time windows. Arrows represent statistically significant causal effects (p<0.05, FDR corrected, F test. See Materials and methods for details).

In this experiment, subjects (n = 28) were presented with normal unfamiliar faces and Mooney faces, they performed a one-back task, indicating the repetition of the same images. Verbal survey after the MEG experiment indicated that subjects could perceive at least 90% of the Mooney images as faces. In all face-selective areas except rOFA, similar peak latencies were observed during the perception of normal and Mooney faces.  Strikingly, Mooney face elicited a response with significantly longer latency in rOFA than normal face (paired t test, t23 = 4.009, p=0.001) (Figure 3B). The temporal relationship of signals in the face-selective areas was quite different during the perception of Mooney faces compared with that of normal ones (Figure 3C). Similar to Experiment 1, OFA were activated slightly earlier than pFFA in response to normal faces (lOFA:124 ± 9 ms, lpFFA: 133 ± 9 ms, Paired permutation test p>0.9; rOFA: 107 ± 4 ms, rpFFA:120 ± 6 ms, Paired permutation test p=0.37. Bonferroni corrected for multiple comparisons). However, when processing Mooney faces, rOFA was engaged significantly later than rpFFA (rOFA:144 ± 8 ms, rpFFA: 117 ± 8 ms. Paired permutation test p=0.02. Bonferroni corrected). The response curve of rOFA was temporally shifted to a later point while the temporal characteristics of rpFFA was not much different from its response to normal faces (Figure 3C). The temporal relationship between OFA and pFFA in left hemisphere is similar to normal face condition (lOFA:127 ± 10 ms, lpFFA: 133 ± 10 ms. Paired permutation test p>0.9, Bonferroni corrected).

To further analyze the dynamic causal relationship between OFA and pFFA, we performed Granger causality analysis over sliding time windows of 50 ms duration from 75 to 230 ms after stimulus presentation which covers the periods of essential activation in OFA and pFFA. The significant directed connectivity in each time window is shown in Figure 3D. There were much more extensive directed influences from pFFA to OFA during the processing of Mooney than normal faces. In particular, rpFFA influenced rOFA in Mooney face condition continuously from 75 to 170 ms, which was more sparsely observed in normal face condition. Thus response time courses and Granger causality analysis together show that, compared with processing of normal faces, the cortical processing of Mooney faces is more dominated by the top-down rpFFA to rOFA projection.

Primarily feedforward processing of face-like stimuli with misarranged internal features

We also investigated the processing dynamics of face-like stimuli with internal features clearly available but spatially misarranged, to contrast with the processing of normal as well as Mooney faces. The normal external features (hair, chin, face outline) and the locally normal internal features led to the engagement of the face-sensitive areas. Results show that the rOFA, rpFFA and raFFA were activated sequentially (rOFA: 132 ± 7 ms, rpFFA: 133 ± 5 ms, raFFA: 169 ± 12 ms. Figure 4B). Compared with the responses to normal faces, the activations in the rOFA and rpFFA were somewhat delayed in the case of the distorted faces. However, unlike the Mooney faces, the distorted faces still engaged the OFA earlier than the FFA, presumably because of the explicitly available local facial features. While the dominant signals are consistent with a feedforward processing from OFA to FFA, there was a hint of a predictive error signal, possibly related to the misarranged spatial configurations, that produced a low activity in rOFA at a later stage.

Temporal response characteristics for face-selective ROIs in response to distorted face.

(A) Example stimuli and averaged time courses for each face-selective ROI. The green horizontal bar indicates significant difference between distorted face and object (cluster-defining threshold p<0.01, corrected significance level p<0.05). (B) Peak latency averaged across subjects for each ROI. The peak latency of raFFA is significant later than rpFFA (paired t test, p=0.019, t8 = 2.92).

Parallel facilitation of face-processing network from contextual cues alone

In real life, facial features are not always available. Previous studies showed that face-specific responses could be elicited by contextual body cues (Cox et al., 2004; Chen and Whitney, 2019; Martinez, 2019). Here we further investigated the dynamics of contextual facilitation of face processing when face perception was supported by contextual cues alone without explicit facial features using the same experimental paradigm and data analysis procedures as before.

Three types of stimuli were presented to subjects: (i) images of highly degraded faces (no internal facial features) with contextual body cues that imply the presence of faces, (ii) similar to images in (i) but with body cues arranged in an incorrect configuration and thus do not imply the presence of faces, (iii) images of objects (Figure 5A). Activation in rOFA, rpFFA and raFFA were significantly higher for the condition in which faces were clearly implied due to the contextual cues compared to the condition when objects were presented (Figure 5B). However, when contextual cues were misarranged so that faces were not strongly implied, only the rOFA showed stronger activation than objects at a late stage (Figure 5B). Furthermore, peak latency analysis revealed that during the perception of ‘faces’ generated from contextual cues alone, the rOFA, rpFFA and raFFA were all engaged at about the same and relatively late time (rOFA: 149 ± 12 ms, rpFFA: 149 ± 14 ms, raFFA: 155 ± 11 ms) rather than activated sequentially (Figure 5C). Thus when the presence of a face was facilitated by external cues alone, the evoked responses in the core face-processing network emerged slowly and almost simultaneously.

Temporal response characteristics for face-selective ROIs in response to contextual cues.

(A) Example stimuli. (B) Time courses averaged across subjects for each condition. For each ROI, Blue horizontal bars indicate significant difference between degraded faces with relevant body cues and objects, and red horizontal bars indicate significant difference between degraded faces with irrelevant body cues and objects (cluster-defining threshold p < 0.05, corrected significance level p < 0.05). (C) The peak latency averaged across subjects for each face-selective ROI (mean± SEM).

Discussion

Using a combined fMRI and MEG source localization approach, our results systematically revealed an intricately detailed dynamic picture of face information processing. Within the ventral occipitotemporal face processing network, normal faces were processed mainly in a bottom-up manner through the hierarchical pathway where input information was processed sequentially from posterior to anterior ventral temporal cortex. This temporal order was also observed when processing face-like stimuli with misarranged internal facial features. In contrast, during the processing of Mooney faces in the absence of prototypical facial features, top-down modulation was more prominent in which the dominant information flow was from the rpFFA to rOFA. Moreover, face-specific responses from contextual cues alone were evoked late and simultaneously across the rOFA, rpFFA and raFFA, suggesting that contextual facilitation acted parallelly on the core face-processing network. These results advance our understanding of the hierarchical and non-hierarchical models of face perception, especially underscoring the stimulus- and context-dependent nature of the processing sequences.

During the perception of 2-tone Mooney faces, it is necessary to discount shadows and recover 3D surface structure from 2D images (Grützner et al., 2010). Interestingly, only familiar objects, like faces, can be interpreted to be volumetric easily from 2-tone representations (Moore and Cavanagh, 1998; Hegde et al., 2007). Thus it is supposed that prior knowledge should play an important role in the recovery of 3D shape from Mooney images (Braje et al., 1998; Gerardin et al., 2010). A top-down model emphasized the guidance of prior experience at higher levels (Cavanagh, 1991). This model is supported by evidence from experiments showing that early visual processing is affected by high-level attributes in both human and monkey (Lee et al., 2002; Humphrey et al., 1997; Issa et al., 2018). As briefly mentioned in the results section, the dynamics of MEG signals associated with processing Mooney faces, which highlights the top-down modulation, is consistent with the explanation based on predictive coding model. It proposed that hypotheses or predictions made at higher cortical areas are compared with, through feedback, representations at lower areas to generate residual error, which is then forwarded to higher stages as ‘neural activity’ (Rao and Ballard, 1999; Murray et al., 2004; Friston, 2005; Friston, 2010). Specifically, the face model/prediction is generated at the rpFFA based on the global configuration of Mooney faces using prior knowledge about 3D faces, illumination, and cast shadows. This prediction of expected facial features is then poorly matched with the input representation at the rOFA which lacks the explicit prototypical facial features due to the mixed illumination-invariant and illumination-dependent features, generating an increased signal at rOFA. Thus, the dominant signal at the rOFA (residual) necessarily lags behind the signal at the rpFFA (hypothesis). However, when processing normal faces or face with misarranged facial features, the prominent signal in the early stage of rOFA is mainly due to the strong feedforward input from early visual cortex as rOFA is robustly responsive to the clear facial components. The prediction feedback from rpFFA would be consistent with representation at the rOFA in the case of the normal faces, resulting in little error signal; with the misarranged facial features, there was a hint of a late increase of rOFA signal, possibly indicating that the feedback signal could contain some spatial information as well.

The timing of face induced neural activation has been studied for a long time with various techniques, such as the combination of MEG and fMRI using representational similarities (Cichy et al., 2014; Cichy et al., 2016), MEG source localization and intracranial EEG (Kadipasaoglu et al., 2017; Keller et al., 2017; Ghuman et al., 2014). An early MEG study suggested two stages (early categorization and late identification) were involved in face processing (Liu et al., 2002). Combined with the fMRI observation that OFA is responsible for identifying facial parts while FFA for holistic configuration (Rotshtein et al., 2005; Liu et al., 2010; Pitcher et al., 2011b; Arcurio et al., 2012; Pitcher et al., 2007; Schiltz, 2010), OFA is expected to respond earlier than FFA. An simultaneous electroencephalogram (EEG)-fMRI study also showed that OFA responded to faces earlier than FFA (OFA: 110 ms; FFA: 170 ms) (Sadeh et al., 2010). Using transient stimulation to temporally disrupt local neural processing, Transcranial Magnetic Stimulation (TMS) experiments suggested that OFA processes facial information at about 100/110 ms, while pSTS begins processing face at about 100/140 ms (Pitcher et al., 2012; Pitcher et al., 2014). However, the sources of N/M170 face selective component remain controversial, it is suggested to come from fusiform gyrus in some studies (Deffke et al., 2007; Kume et al., 2016; Perry and Singh, 2014). While some other studies emphasized the contribution of inferior occipital gyrus besides fusiform gyrus (Itier et al., 2006; Gao et al., 2013) or even of pSTS (Nguyen and Cunnington, 2014). Our results provide more precise and detailed timing information of the core face network under various stimulus and contextual conditions, especially the temporal relationship between rpFFA and raFFA. raFFA is engaged significantly later, about 20 ms after the rpFFA, suggesting that the raFFA likely plays a different functional role from rpFFA. This idea is supported by previous anatomical evidence showing that pFFA and aFFA have different cellular architectures (Weiner et al., 2017).

Our results also shed light on the role of internal and external features in face perception. Although when assembling into a whole face, facial features are processed holistically and the representation of internal features are influenced by external features (Andrews et al., 2010), eyes in isolation elicit a later but larger N170 (Bentin et al., 1996; Rossion and Jacques, 2011) and can drive face-selective neurons as well as full-face images (Issa and DiCarlo, 2012) in monkeys. In our results, the somewhat slower but still sequential progression of face responses elicited by face-like stimuli with clear but misarranged internal features in face outline further supports that facial features are sufficient to trigger the bottom-up face processing sequence. In addition, certain stimulus manipulations, such as face inversion (Bentin et al., 1996), contrast reversal, Mooney transformation or removal of facial features produced comparable (or even increased amplitude) but delayed N170 responses (Rossion and Jacques, 2011). Thus it is suggested that as long as the impoverished stimuli is perceived as a face, inferior temporal cortex areas would be activated (McKeeff and Tong, 2007; Grützner et al., 2010). Our results provide further more details for this explanation by showing the top-down rpFFA to rOFA projection when the prototypical facial features are lack.

Besides facial features, contextual information is also important for face interpretation (Chen and Whitney, 2019; Martinez, 2019). Interestingly, FFA can be activated by the perceived presence of faces from contextual body cues alone (Cox et al., 2004). Here our MEG data showed that the face-selective areas in ventral core face network were indeed activated by the contextual cues for faces, but they were not activated in any order, instead, they became active together at a late stage. This is similar to the temporal dynamics observed in visual imagery, a top-down process given the absence of visual inputs (Dijkstra et al., 2018). Future studies are needed to elucidate how core face network interacts with other brain regions to trigger the face perception. For example, according to a MEG study using fast periodic visual stimulation approach (Rossion et al., 2012; Rossion et al., 2015; de Heering and Rossion, 2015), top-down attention increase the response in FFA by gamma synchrony between the inferior frontal junction and FFA (Baldauf and Desimone, 2014).

Face perception is shaped by long-term visual experience, for example, familiar faces are processed more efficiently than unfamiliar ones (Landi and Freiwald, 2017; Schwartz and Yovel, 2016; Dobs et al., 2019; Gobbini and Haxby, 2006). In terms of the dynamics in the ventral occipitotemporal areas, the present results showed little differences between processing famous and unfamiliar faces. This could be due to several reasons. First, many studies suggested that regions in the anterior temporal lobe rather than OFA and FFA represent face familiarity (Gobbini and Haxby, 2007; Pourtois et al., 2005; Sugiura et al., 2011). However, extended face system is beyond the scope of our current study because some areas in the extended system are too deep to obtain a good MEG source signal. Second, some subjects might not be familiar with all the famous faces we used. Third, familiarity may affect face recognition via high gamma frequency band activity (Anaki et al., 2007), which is not included in our data analysis.

Bilateral pSTS showed weak and multi-peaked responses during both famous and unfamiliar face processing despite the task differences. One possible reason for the multiple peaks of responses is that as a hub for integrating information from multiple sources (e.g., face, body, and voice), STS contains regions that respond to different types of information (Grossman et al., 2005; Bernstein and Yovel, 2015). A lot of studies have suggested diverse functional role of pSTS in representing changeable aspects of faces, such as expression, lip movement and eye-gaze (Baseler et al., 2014; Engell and Haxby, 2007). Specifically, pSTS is involved in the analysis of facial muscle articulations which are combined to produce facial expressions (Srinivasan et al., 2016; Martinez, 2017). In addition, pSTS may respond to dynamic motion information conveyed through faces (O'Toole et al., 2002).

Previous studies showed that left and right fusiform gyrus are differentially involved in face/non-face judgements (Meng et al., 2012; Goold and Meng, 2017), ‘low-level’ face semblance and perceptual learning of face (Bi et al., 2014; Feng et al., 2011; McGugin et al., 2018). Interestingly, in our results, the peak latency of the left pFFA was later than that of the right pFFA in all conditions except famous face. Responses evoked from distorted faces with misarranged features had the largest lateral difference (20 ms). One possible reason is that the signal attributed to the left pFFA is in fact a mixture of signals from pFFA and aFFA.

Although the exact correspondence between human and macaque face-selective areas are still unclear (Tsao et al., 2003; Tsao et al., 2006; Tsao et al., 2008), the dynamic picture of normal face processing revealed in our study is generally similar to that in macaques. Single-unit recording studies showed that activity begins slightly earlier in posterior face patches than anterior ones, reaching peak levels around 126, 133, and 145 ms for middle lateral (ML)/middle fundus (MF), anterior lateral (AL), and anterior medial (AM) (Freiwald and Tsao, 2010) , respectively. Interestingly, there is a discrepancy in response to Mooney faces in high level face patch AM between two monkeys. One of them showed nearly the same peak latency as normal faces but with more sustained activation, while the other did not response to Mooney faces (Moeller et al., 2017). This may imply that the processing of Mooney faces is related to individual face detection ability or life experience and face processing is not a simple feedforward process from low level to high level areas. Consistent with that, a more recent study showed a rapid and more sustained response in high level face area (aIT) and an early rising then quickly decreased activity in low level areas in monkeys, a signature of predictive coding model (Issa et al., 2018).

Our study is obviously limited in scope. There are many types of cues and tasks relevant for face perception that could be investigated. In addition to facial features and context, many low level cues contribute to face recognition, such as illumination direction, pigmentation (surface appearance) and contrast polarity (one region brighter than another) (Russell et al., 2007; Sinha et al., 2006). In particular, neurons tuned for contrast polarity were found in macaque inferotemporal cortex, supporting the notion that low-level image properties are encoded in face regions (Ohayon et al., 2012; Weibert et al., 2018). We purposely avoided the complication of color cues in this study by using gray-scale images, but we are aware the importance of color in face perception (Yip and Sinha, 2002; Benitez-Quiroz et al., 2018). Moreover, the temporal dynamics of face processing could very well be influenced by different tasks. In our results, there is little difference between the temporal patterns in response to unfamiliar faces under face category task (Figure 2—figure supplement 1) and image identity one-back task (Figure 3). Future studies are needed to more comprehensively investigate the role of behavioral tasks, especially during the relatively late stages of face processing.

In summary, our study delineated the precise timing of bottom-up, top-down, as well as context-facilitated processing sequences in the occipital-temporal face network. These results provide a way to understand and reconcile previous discrepant findings, revealing the dominant bottom-up processing when explicit facial features were present, and highlighting the importance of the top-down feedback operations when faced with impoverished inputs with unclear or ambiguous facial features.

Materials and methods

Participants

All subjects (age range 19–31) provided written informed consent and consent to publish before the experiments, and experimental protocols were approved by the Institutional Review Board of the Institute of Biophysics, Chinese Academy of Sciences (#2017-IRB-004). The image used in Figure 3 is a photograph of one of the authors and The Consent to Publish Form was obtained.

Experiment 1 (normal famous and unfamiliar face)

Request a detailed protocol

Fifteen subjects were presented with famous faces (popular film actors, 50% female) and objects (houses, scenery and small manmade objects) and were instructed to perform a category classification task (face or object) while their brain activity was recorded using MEG. Two subjects with excessive head motion (>5 mm) were excluded from further analysis. Each type of image includes 50 exemplars and all faces are own race faces. All images used were equated for contrast and mean luminance using the SHINE toolbox (Willenbockel et al., 2010). Each trial was initiated with a fixation with a jittered duration (800–1000 ms), then a grayscale visual image (face or object, 8 × 6 °) was presented at the center of screen for 500 ms, followed by a response period. Subjects were asked to maintain fixation and report whether the image was a face or an object using button press as soon as possible. There were 120 trials for each condition. Nine of the thirteen subjects participated in an additional experiment in which unfamiliar faces were used.

Experiment 2 (normal unfamiliar face and Mooney face)

Request a detailed protocol

Experiment two was conducted similar to Experiment 1, except that unfamiliar faces and two-tone Mooney faces were presented to subjects (n = 28) in separate blocks (15 trials each) during which subjects performed a one-back task. Two subjects with excessive head motion (>5 mm) were excluded from further analysis.

Experiment 3 (face-like images with spatially misarranged internal features)

Request a detailed protocol

Experiment three was conducted similar to Experiment 1, except that distorted face and object images were presented to subjects (n = 9). Distorted face images were created by rearranging the eyes, mouth and nose into a nonface configuration (Liu et al., 2002).

Experiment 4 (contextual cues defined the presence of faces without internal features)

Request a detailed protocol

Experiment four was conducted similar to Experiment 2. Three types of stimuli (Figure 5A) were created as described in previous study (Cox et al., 2004): (i) images of highly degraded faces (no internal facial features) with contextual body cues that imply the presence of faces, (ii) similar to images in (i) but with body cues arranged in an incorrect configuration and thus do not imply the presence of faces, (iii) images of objects. Fifteen subjects participated in this experiment and one of them was excluded from further data analysis due to excessive head motion (>5 mm).

MEG data acquisition and analysis

MEG data were recorded continuously using a 275-channel CTF system. Three coils were attached on the head, one close to nasion, and the other two close to left and right preauricular points respectively. fMRI scanning was performed shortly after MEG data collection, and the locations of coils were marked with vitamin E caplets to align with MEG frames. MEG data analysis was performed using MATLAB (RRID: SCR_001622) and Fieldtrip toolbox (Oostenveld et al., 2011) (RRID: SCR_004849) for artifact detection and MNE-python (RRID: SCR_005972) for source analysis (Gramfort et al., 2013; Gramfort et al., 2014).

Preprocessing

Request a detailed protocol

After acquisition, we first conducted time correction as there was time delay (measured with a photodiode) between the stimulus onset on the screen and the trigger signal in the recorded MEG data. Then the data were bandpass filtered with a frequency range of 2–80 Hz and epoched from 250 ms before to 550 ms after the stimulus onset. Bad channels and trials contaminated by artifacts including eye blinks, muscle activities and SQUID jumps were removed before further analysis.

Source localization

Request a detailed protocol

Source localization can be generally divided into two steps, forward solution and inverse solution. Boundary-element model (BEM) which describes the geometry of the head and conductivities of the different tissues, coregistration information between MEG and MRI, and volume source space which defines the position of the source locations (10242 sources per hemisphere and the source spacing is 3.1 mm) were used to calculate forward solution. For inverse solution, we first estimated noise and data covariance matrix from −250 to 0 ms epochs and 100 to 350 ms epochs respectively. Afterwards, the Linearly Constrained Minimum Variance (LCMV) beamformer was calculated using covariance matrix and forward solution (Van Veen et al., 1997). The regularization for the whitened data covariance is 0.01. The source orientation which maximizes output source power is selected.

Time course analysis

Request a detailed protocol

To explore the time course, virtual sensors were computed on the 30 Hz low-pass filtered data using the LCMV beamformer at the grid points within individual face-selective areas. The time course of each face-selective area was extracted from the grid point showing max value of MEG response. Subjects who did not show corresponding face-selective areas in fMRI localizer were excluded from time course extraction (See Table 1 for details). To identify time-points of significant differences, we performed non-parametric statistical tests with cluster-based multiple comparison correction (Maris and Oostenveld, 2007).

Table 1
Number of subjects showing fMRI defined face-selective areas.
Experiment 1Experiment 2Experiment 3Experiment 4
famous faceUnfamiliar face
IOFA13/139/925/269/913/14
IpFFA13/139/926/269/914/14
IpSTS13/139/918/269/911/14
rOFA13/139/926/269/914/14
rpFFA13/139/926/269/914/14
raFFA12/139/918/269/912/14
rpSTS13/139/923/269/914/14

Peak latency analysis

Request a detailed protocol

For each ROI of each subject, peak latency was defined as the timing of the largest peak within the first 250 ms of averaged response. To avoid the influence of bad source data with weak signal, time course without any time points showing response 5 SDs above the baseline (time average from −250 to 0 ms) was eliminated from peak analysis. The numbers of subjects used in peak latency analysis are summarized in Table 2. Two-tailed paired t tests (subjects with missing values were excluded) were used to compare the peak latencies between ROIs. While in Experiment 2, a more rigorous statistical approach, two sample paired permutation test (10000 permutations), was used to compare the peak latencies between pFFA and OFA (See results for details).

Table 2
Number of subjects used in peak latency analysis.
Experiment 1Experiment 2Experiment 3Experiment 4
famous faceunfamiliar facenormal faceMooney facedistorted facecontaxtual cues defined face
IOFA13/139/924/2624/269/913/14
IpFFA12/139/925/2625/269/912/14
IpSTS11/138/9---
rOFA13/139/924/2626/269/913/14
rpFFA13/139/924/2625/269/913/14
raFFA12/138/918/2615/269/910/14
rpSTS12/137/9----

Granger causality analysis

Request a detailed protocol

To study the regional information flow between ROIs, we employed Granger causality analysis (Granger, 1969) which is a statistical technique that based on the prediction of one time series on another. Time courses used in this analysis were extracted from each ROI without low-passed filtering. Causality analysis was performed using Multivariate Granger Causality (MVGC) toolbox (Barnett and Seth, 2014). Evoked response was removed from the data by linear regression before further analysis because the time series is assumed to be stationary in Granger causality analysis and this assumption is challenged in evoked brain responses (Wang et al., 2008). We conducted separate analysis over a series of overlapping 50 ms time windows (based on a previous study Ashrafulla et al., 2013) from 75 to 230 ms, which covers the period of face-induced activation in both OFA and FFA. There is a trade-off between stationary, temporal resolution (shorter is better) and accuracy of model fit (longer is better) when considering the size of time window. Moreover, smaller window is not considered because activity beyond Beta-band is not strong according to the power spectrum. First, the best model order was selected according to Bayesian information criteria (BIC). Then the corresponding vector auto regressive (VAR) model parameters were estimated for the selected model order and the autocovariance sequence for the VAR model was calculated. Then the bidirectional Granger causality values for each pair ROI were obtained by calculating pairwise-conditional time-domain MVGCs based on autocovariance sequence. Finally, to evaluate whether causality values are significantly greater than zero (assume null hypothesis causality value = 0), we performed significance test using F null distribution with FDR correction for multiple comparisons (Benjamini and Hochberg, 1995).

fMRI data acquisition and analysis

Request a detailed protocol

Scanning was performed on a 3T Siemens Prisma scanner in the Beijing MRI Center for Brain Research. We acquired high-resolution T1-weighed anatomical volumes first, and then performed a run of functional face localizer (Pitcher et al., 2011a) with interleaved face and object blocks using a gradient echo-planar sequence (20-channel head coil, TR = 2 s, TE = 30 ms, resolution 2.0 × 2.0 × 2.0 mm, 31 slices, matrix = 96 × 96). fMRI data were analyzed using FreeSurfer (RRID: SCR_001847) and AFNI (RRID: SCR_005927). Face-selective areas were defined as regions that responded more strongly to faces than to objects.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
    What’s up in top-down processing
    1. P Cavanagh
    (1991)
    In: A Gorea, editors. Representations of Vision. Cambridge University Press. pp. 295–304.
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
    A theory of cortical responses
    1. K Friston
    (2005)
    Philosophical Transactions of the Royal Society B: Biological Sciences 360:815–836.
    https://doi.org/10.1098/rstb.2005.1622
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
    Age in the development of closure ability in children
    1. CM Mooney
    (1957)
    Canadian Journal of Psychology/Revue Canadienne De Psychologie 11:219–226.
    https://doi.org/10.1037/h0083717
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
    The N170: understanding the time-course of face perception in the human brain
    1. B Rossion
    2. C Jacques
    (2011)
    In: S. J Luck, E. S Kappenman, editors. The Oxford Handbook of Event-Related Potential Components. Oxford University Press. pp. 115–141.
    https://doi.org/10.1093/oxfordhb/9780195374148.013.0064
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107

Decision letter

  1. Ming Meng
    Reviewing Editor; South China Normal University, China
  2. Joshua I Gold
    Senior Editor; University of Pennsylvania, United States
  3. Ming Meng
    Reviewer; South China Normal University, China

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Through four experiments, your article combines fMRI and source-localized Magnetoencephalography (MEG) to investigate the dynamics of face information processing in the human brain. I found most interesting your results of the temporal dynamics of the occipital-temporal face network contingent upon bottom-up processing of normal facial inputs versus top-down processing of impoverished facial inputs, which were supported by converging evidence. While there were criticisms by our reviewers on reliability of MEG source localization, new experiments in the revised version of the article provided solid data that greatly strengthened our confidence with the novel technique approach, complementing a large number of previous neuroimaging and neurophysiological studies. Your findings not only fill the knowledge gap of dynamic interactions between the nodes of core face processing network, but also reconcile previous competing models of bottom-up versus top-down face processing mechanisms. Given the importance of face information processing in cognitive psychology, social and affective neurosciences, as well as artificial intelligence, I believe a broad research community including psychologists, neuroscientists and computer scientists would benefit from reading this article. In addition, I think the novel methodological approach that combines fMRI and MEG with clever stimulus design would inspire future studies to follow these steps to further investigate fine-scale temporal dynamics of other important cognitive brain mechanisms.

Decision letter after peer review:

Thank you for sending your article entitled "The bottom-up and top-down processing of faces in the human occipitotemporal cortex" for peer review at eLife. Your article is being evaluated by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation is being overseen by Joshua Gold as the Senior Editor.

Specifically, we think these major issues need to be fully addressed. In the interest of time, eLife normally would only invite a revision if all the major issues could be fully addressed within two months. Should you decide to submit the manuscript elsewhere, I am appending full reviews below that you can use to improve the paper as well:

Major issues:

1) The empirical and conceptual advances made in the current study need to be more clearly articulated with respect to previous work. It has been known for a while that the OFA responds at an earlier latency than the FFA (e.g., Liu et al., 2002), and that certain stimulus manipulations, such as face inversion and contrast reversal, lead to delayed responses to faces (Bentin et al., 1996; Rossion et al., 2000; Rossion et al., 2012). Previous fMRI work has shown that difficult to perceive Mooney faces can lead to response delays on the order of several seconds (McKeeff and Tong, 2007). More recent techniques have allowed research groups to provide more refined estimates of the timing of neural responses, such as the fusion of fMRI-MEG analyzed using representational similarity analysis (e.g., Cichy et al., 2014). Periodic visual stimulation has also been used to characterize the timing of neural responses obtained with EEG/MEG by several research groups (e.g., Rossion et al., 2012, 2014; Norcia et al., 2015), and this approach has been successfully applied to characterize top-down effects of feedback during face processing (e.g., Baldauf and Desimone, 2014).

2) Also, what is lacking significantly is the role of pSTS. We know pSTS is mostly involved in the analysis of facial muscle articulations (also called action units, AUs) and the interpretation of facial expressions and emotion, see Srinivasan et al., 2016, and Martinez, 2017. Also relevant is the role of low-level image features (Weibert et al., 2018), which is also missing from the Discussion; and, the role of color perception (Yip and Sinha, 2002; Benitez-Quiroz et al., 2018).

3) Another point that needs further discussion is the role of internal versus external face features (Sinha et al., 2006), and context (Sinha, Science 2004; Martinez, 2019). These discussions are essential to frame the results of the present paper within existing models of face perception.

4) The conclusions of the study rest on the data from a single experiment, and further investigation of the putative effects of top-down feedback and predictive coding are not provided. A follow-up experiment that both replicates and extends the current findings would help strengthen the study.

5) The reported effects pass statistical significance but not by a large margin. Moreover, there can be concerns that MEG data varies considerably across participants and can lead to heterogeneity of variance, especially across time points. Shuffling of the data with randomized labels would provide a more rigorous approach to statistical analysis.

Reviewer #1:

The neural mechanism of face processing has been a central topic of cognitive neuroscience for many years, however, dynamics of such mechanism remains unclear. He and colleagues combined fMRI ROI localization and reconstructing source signals from MEG to address this issue. Specifically, the authors analyzed MEG activity dynamics of the face processing core network that had been localized by fMRI. Most notably, when subjects were seeing famous faces, rOFA and rpFFA activity peaked at around 120 ms while raFFA activity peaked at around 150 ms. By contrast, when subjects were seeing Mooney face images, the rOFA activity peaked significantly later than the rpFFA activity. Given that recognizing faces from Mooney images would rely more heavily on top-down mechanisms, the authors argue for a top-down pathway from the rpFFA to rOFA for face processing.

The results are clear-cut and the paper is in general well-written. I believe the present study, if in the end published, would be of interests to a broad readership including psychologists and neuroscientists. I only have a few comments that I wish the authors to address:

1) While recognizing faces from Mooney images would certainly rely heavily on top-down mechanisms, it is hard to rule out the involvement of top-down mechanisms when processing normal face pictures. Intuitively, for example, processing familiar faces would involve more top-down experience driven activity than processing unfamiliar faces. However, the present results seem to suggest no significant differences between processing famous and unfamiliar faces. How come?

2) The Discussion somewhat overlooks effects potentially driven by different tasks. As far as I understand, subjects performed different tasks for the Mooney face experiment and normal face versus object picture experiments.

3) Given studies on the functional role of left FFA (e.g., Meng et al., 2012; Bi et al., 2014; Goold and Meng, 2017), I would be greatly interested in Results and Discussions regarding what the present data could reveal about dynamic relations between the left and right face processing core networks.

4) Some justification would be helpful for using sliding time windows of 50 ms. One possibility is to add power spectrum analysis. In any cases, power spectrum analysis might be helpful for revealing further fine-scale temporal dynamics of brain responses.

Reviewer #3:

The authors use MEG to measure cortical responses to normal faces and Mooney face images, and find that in the former case, the putative OFA responds at a somewhat earlier latency than the FFA while in the latter case, the FFA responds at a significantly earlier latency. Granger causality provides additional support for the authors' interpretation that feedback may be occurring from the FFA to the OFA.

The findings are of some interest but there are some major concerns. First, the discussion of previous work is rather limited and does not cite many related studies that have characterized the timing of face processing in the FFA and OFA. It has been known for a while that the OFA responds at an earlier latency than the FFA (e.g., Liu et al., 2002), and that certain stimulus manipulations, such as face inversion and contrast reversal, lead to delayed responses to faces (Bentin et al., 1996; Rossion et al., 2000; Rossion et al., 2012). Previous fMRI work has shown that difficult to perceive Mooney faces can lead to response delays on the order of several seconds (McKeeff and Tong, 2007). More recent techniques have allowed research groups to provide more refined estimates of the timing of neural responses, such as the fusion of fMRI-MEG analyzed using representational similarity analysis (e.g., Cichy et al., 2014). Periodic visual stimulation has also been used to characterize the timing of neural responses obtained with EEG/MEG by several research groups (e.g., Rossion et al., 2012, 2014; Norcia et al., 2015), and this approach has been successfully applied to characterize top-down effects of feedback during face processing (e.g., Baldauf and Desimone, 2014). The empirical and conceptual advances made in the current study need to be more clearly articulated with respect to previous work, and a clear argument for the specific contributions of this study is needed.

Another concern is that the conclusions of the study rest on the data from a single experiment, and further investigation of the putative effects of top-down feedback and predictive coding are not provided. Reproducibility is a serious concern in many fields of science, especially psychology and also neuroscience. A follow-up experiment that both replicates and extends the current findings would help strengthen the study. The reported effects pass statistical significance but not by a large margin. Moreover, there can be concerns that MEG data varies considerably across participants and can lead to heterogeneity of variance, especially across time points. Shuffling of the data with randomized labels would provide a more rigorous approach to statistical analysis.

Reviewer #4:

Authors present an interesting and timely study of the hierarchical functional computations executed during bottom-up and top-down face processing. The results are mostly consistent with what is known and accepted. This is important to support existing models.

A point that is lacking significantly is the role of pSTS. We know pSTS is mostly involved in the analysis of facial muscle articulations (also called action units, AUs) and the interpretation of facial expressions and emotion, see Srinivasan et al., 2016, and Martinez, 2017. Also relevant is the role of low-level image features (Weibert et al., 2018), which is also missing from the Discussion; and, the role of color perception (Yip and Sinha, 2002; Benitez-Quiroz et al., 2018).

Another point that needs further discussion is the role of internal versus external face features (Sinha et al., 2006), and context (Sinha, Science 2004; Martinez, 2019).

These discussions are essential to frame the results of the present paper within existing models of face perception. With appropriate changes, this could be a strong paper.

https://doi.org/10.7554/eLife.48764.sa1

Author response

Major issues:

1) The empirical and conceptual advances made in the current study need to be more clearly articulated with respect to previous work. It has been known for a while that the OFA responds at an earlier latency than the FFA (e.g., Liu et al., 2002), and that certain stimulus manipulations, such as face inversion and contrast reversal, lead to delayed responses to faces (Bentin et al., 1996; Rossion et al., 2000; Rossion et al., 2012). Previous fMRI work has shown that difficult to perceive Mooney faces can lead to response delays on the order of several seconds (McKeeff and Tong, 2007). More recent techniques have allowed research groups to provide more refined estimates of the timing of neural responses, such as the fusion of fMRI-MEG analyzed using representational similarity analysis (e.g., Cichy et al., 2014). Periodic visual stimulation has also been used to characterize the timing of neural responses obtained with EEG/MEG by several research groups (e.g., Rossion et al., 2012, 2014; Norcia et al., 2015), and this approach has been successfully applied to characterize top-down effects of feedback during face processing (e.g., Baldauf and Desimone, 2014).

We appreciate and agree with this suggestion. The dynamics of face induced neural activation in FFA and OFA has been studied for a long time with various techniques. However, previous results are inconsistent and individually often lack either the spatial (e.g., sensor level EEG/MEG analysis) or temporal precision (e.g., fMRI data). Our results with combined fMRI and MEG measures, provide detailed and novel timing information of the core face network. For example, the relatively large temporal gap between the right anterior and posterior FFA was not reported in previous studies. Furthermore, our results showed that the temporal relationships between OFA and FFA are dependent on the internal facial features as well the context of visual input, which helps to understand how bottom-up and top-down processing together contribute to face perception.

Many previous studies used the N170/M170 component as the index of face processing in the ventral occipitotemporal cortex, however, the delayed N170/M170 response caused by certain stimulus manipulations (eg: face inversion, Mooney transformation) represents a relatively crude measure of face processing because the difficulty in attributing the sources of the delay. On the other hand, fMRI measures alone showing delayed FFA response to Mooney faces that was initially not recognized as faces simply reflect the time it took subjects to recognize difficult Mooney faces, rather than the real-time dynamics of Mooney face processing. In contrast, our results showed that when the face features were confounded with other shadows, the top-down rpFFA to rOFA projection became more dominated.

In the revised manuscript, we discussed the different techniques used to investigate the timing of face responses and the top-down modulation in face processing reported in previous studies (Discussion section paragraph three to five).

2) Also, what is lacking significantly is the role of pSTS. We know pSTS is mostly involved in the analysis of facial muscle articulations (also called action units, AUs) and the interpretation of facial expressions and emotion, see Srinivasan et al., 2016, and Martinez, 2017. Also relevant is the role of low-level image features (Weibert et al., 2018), which is also missing from the Discussion; and, the role of color perception (Yip and Sinha, 2002; Benitez-Quiroz et al., 2018).

The temporal responses of bilateral pSTS are broader (multi-peaked) and showed lower signal-to-noise than the ventral face-selective areas (Figure 2 and Figure 2—figure supplement 1). To increase our confidence about the pSTS time course, we analyzed the temporal responses of bilateral pSTS evoked by normal faces based on the additional data (Experiment 2), and the time courses basically remained the same as the previous ones (regardless of the task and face familiarity). We have added more discussion about the role of pSTS and its dynamics, especially in relation to the processing of facial expression, muscle articulations and motion.

Author response image 1

We also thank the reviewer for reminding us about the role of low-level features including color, and have added more discussion about their role in face processing.

3) Another point that needs further discussion is the role of internal versus external face features (Sinha et al., 2006), and context (Sinha, Science 2004; Martinez, 2019). These discussions are essential to frame the results of the present paper within existing models of face perception.

We agree that it is important to understand the role of internal versus external face features. Since we were going to obtain more experimental data during the revision, we made the efforts to performed additional MEG experiments to specifically investigate the role of internal versus external face features and context (see #4 below). We have also added more discussion about them.

4) The conclusions of the study rest on the data from a single experiment, and further investigation of the putative effects of top-down feedback and predictive coding are not provided. A follow-up experiment that both replicates and extends the current findings would help strengthen the study.

We thank the editor and reviewer for pushing us to perform a follow-up experiment. We did not just one but three follow-up experiments (one replication and two extensions), which indeed replicated and significantly extended the findings reported in the original version.

We collected more data for Experiment 2 (normal unfamiliar face vs Mooney face) to confirm the previous results and performed two additional experiments to extend our findings. The replication data and the new experiments are reported in the revised manuscript.

Replication: we collected data from 15 additional subjects using normal faces and Mooney faces. The results were consistent with previous ones with enhanced statistical power (see Results).

Extension 1: To further study the role of internal (eyes, nose, mouth) versus external (hair, chin, face outline) face features, we presented distorted face images (explicit internal facial features available but spatially misarranged without changing face contour) to subjects and analyzed data as before. Consistent with our hypothesis, the clear face components (even though misarranged) evoked strong responses in rOFA, without clear evidence of a late signal corresponding to prediction error, indicating that spatial configuration of internal face features was not a prominent part of the prediction error from rFFA to rOFA. In this case, the processing sequence for the distorted faces would be similar to that elicited by normal face.

Extension 2: In a new experiment, we also investigated the role of context in face processing by presenting three types of stimuli to subjects: (i) images of highly degraded faces with contextual body cues which imply the presence of faces, (ii) images of degraded faces and body cues arranged in an incorrect configuration and thus do not imply the presence of faces, (iii) images of objects. Results showed that rOFA, rpFFA and raFFA are activated almost simultaneously at a late stage, implying a parallel contextual modulation of the core faceprocessing network. This result further emphasize the importance of internal face features in driving the sequential OFA to FFA processing, and help our understanding of the dynamics of contextual modulation in face perception.

5) The reported effects pass statistical significance but not by a large margin. Moreover, there can be concerns that MEG data varies considerably across participants and can lead to heterogeneity of variance, especially across time points. Shuffling of the data with randomized labels would provide a more rigorous approach to statistical analysis.

As described in #4 above, we collected data from additional 15 subjects for the Mooney face experiment (normal unfamiliar faces vs. Mooney faces). Combined with previous data, nonparametric permutation tests were performed to check the significance level of observed time difference between rOFA and rpFFA. The results are consistent with previous ones with enhanced statistical power (see Results).

Reviewer #1:

[…] The results are clear-cut and the paper is in general well-written. I believe the present study, if in the end published, would be of interests to a broad readership including psychologists and neuroscientists. I only have a few comments that I wish the authors to address:

1) While recognizing faces from Mooney images would certainly rely heavily on top-down mechanisms, it is hard to rule out the involvement of top-down mechanisms when processing normal face pictures. Intuitively, for example, processing familiar faces would involve more top-down experience driven activity than processing unfamiliar faces. However, the present results seem to suggest no significant differences between processing famous and unfamiliar faces. How come?

This is a very valid point. This comment helped us to clarify that the difference between processing Mooney images and normal faces are not absolute. While the top-down mechanisms are more dominant in the case of Mooney faces, it is certainly also involved, but to a less degree, in the processing of normal faces. With regard to the processing of familiar vs. unfamiliar faces, our data show that there was little difference between them. It is likely that familiarity plays a more important role in the more anterior and medial regions of the temporal cortex. We clarified our writings and discussed this issue in the revised manuscript.

2) The Discussion somewhat overlooks effects potentially driven by different tasks. As far as I understand, subjects performed different tasks for the Mooney face experiment and normal face versus object picture experiments.

We thank the reviewer for pointing this out. Yes, category task (face or not) was used in normal (familiar or unfamiliar) faces vs objects experiment, and one-back task was used in normal unfamiliar faces vs Mooney faces experiment. We had the opportunity to check the effects of task using the unfamiliar faces, since the same stimuli were used in the category task and the one-back task. Results show that there was no significant task effect in the timing of activation of the core face areas. We added more description about the different tasks used in the Materials and methods section and also added some discussion in the Discussion section.

3) Given studies on the functional role of left FFA (e.g., Meng et al., 2012; Bi et al., 2014; Goold and Meng, 2017), I would be greatly interested in Results and Discussions regarding what the present data could reveal about dynamic relations between the left and right face processing core networks.

We agree that the dynamic relations between the left and right face networks are interesting. Our results include data from both left and right face networks, though it was not feasible to further separate the left FFA into the anterior and posterior regions. We have added more discussion about the differences between left and right face processing core networks.

4) Some justification would be helpful for using sliding time windows of 50 ms. One possibility is to add power spectrum analysis. In any cases, power spectrum analysis might be helpful for revealing further fine-scale temporal dynamics of brain responses.

The 50 ms time window was selected based on previous study (Ashrafulla et al., 2013), which is a compromise in balancing the temporal precision and reliability of causality analysis. In other words, there is a trade-off between temporal resolution (shorter is better) and accuracy of model fit (longer is better) when considering the size of time window. In addition, we did not consider shorter time window because activity/power drops quickly beyond Β-band based on the power spectrum (see Materials and methods).

Author response image 2

Reviewer #3:

[…] The findings are of some interest but there are some major concerns. First, the discussion of previous work is rather limited and does not cite many related studies that have characterized the timing of face processing in the FFA and OFA. It has been known for a while that the OFA responds at an earlier latency than the FFA (e.g., Liu et al., 2002), and that certain stimulus manipulations, such as face inversion and contrast reversal, lead to delayed responses to faces (Bentin et al., 1996; Rossion et al., 2000; Rossion et al., 2012). Previous fMRI work has shown that difficult to perceive Mooney faces can lead to response delays on the order of several seconds (McKeeff and Tong, 2007). More recent techniques have allowed research groups to provide more refined estimates of the timing of neural responses, such as the fusion of fMRI-MEG analyzed using representational similarity analysis (e.g., Cichy et al., 2014). Periodic visual stimulation has also been used to characterize the timing of neural responses obtained with EEG/MEG by several research groups (e.g., Rossion et al., 2012, 2014; Norcia et al., 2015), and this approach has been successfully applied to characterize top-down effects of feedback during face processing (e.g., Baldauf and Desimone, 2014). The empirical and conceptual advances made in the current study need to be more clearly articulated with respect to previous work, and a clear argument for the specific contributions of this study is needed.

We appreciate and agree with this suggestion. The dynamics of face induced neural activation in FFA and OFA has been studied for a long time with various techniques. However, previous results are inconsistent and individually often lack either the spatial (e.g., sensor level EEG/MEG analysis) or temporal precision (e.g., fMRI data). Our results with combined fMRI and MEG measures, provide detailed and novel timing information of the core face network. For example, the relatively large temporal gap between the right anterior and posterior FFA was not reported in previous studies. Furthermore, our results showed that the temporal relationships between OFA and FFA are dependent on the internal facial features as well the context of visual input, which helps to understand how bottom-up and top-down processing together contribute to face perception.

Many previous studies used the N170/M170 component as the index of face processing in the ventral occipitotemporal cortex, however, the delayed N170/M170 response caused by certain stimulus manipulations (eg: face inversion, Mooney transformation) represents a relatively crude measure of face processing because the difficulty in attributing the sources of the delay. On the other hand, fMRI measures alone showing delayed FFA response to Mooney faces that was initially not recognized as faces simply reflect the time it took subjects to recognize difficult Mooney faces, rather than the real-time dynamics of Mooney face processing. In contrast, our results showed that when the face features were confounded with other shadows, the top-down rpFFA to rOFA projection became more dominated.

In the revised manuscript, we discussed the different techniques used to investigate the timing of face responses and the top-down modulation in face processing reported in previous studies (Discussion section).

Another concern is that the conclusions of the study rest on the data from a single experiment, and further investigation of the putative effects of top-down feedback and predictive coding are not provided. Reproducibility is a serious concern in many fields of science, especially psychology and also neuroscience. A follow-up experiment that both replicates and extends the current findings would help strengthen the study. The reported effects pass statistical significance but not by a large margin. Moreover, there can be concerns that MEG data varies considerably across participants and can lead to heterogeneity of variance, especially across time points. Shuffling of the data with randomized labels would provide a more rigorous approach to statistical analysis.

We thank the editor and reviewer for pushing us to perform a follow-up experiment. We did not just one but three follow-up experiments (one replication and two extensions), which indeed replicated and significantly extended the findings reported in the original version.

We collected more data for Experiment 2 (normal unfamiliar face vs Mooney face) to confirm the previous results and performed two additional experiments to extend our findings.

The replication data and the new experiments are reported in the revised manuscript.

Replication: we collected data from 15 additional subjects using normal faces and Mooney faces. The results were consistent with previous ones with enhanced statistical power (see Results).

Extension 1: To further study the role of internal (eyes, nose, mouth) versus external (hair, chin, face outline) face features, we presented distorted face images (explicit internal facial features available but spatially misarranged without changing face contour) to subjects and analyzed data as before. Consistent with our hypothesis, the clear face components (even though misarranged) evoked strong resonses in rOFA, without clear evidence of a late signal corresponding to prediction error, indicating that spatial configuration of internal face features was not a prominent part of the prediction error from rFFA to rOFA. In this case, the processing sequence for the distorted faces would be similar to that elicited by normal face.

Extension 2: In a new experiment, we also investigated the role of context in face processing by presenting three types of stimuli to subjects: (i) images of highly degraded faces with contextual body cues which imply the presence of faces, (ii) images of degraded faces and body cues arranged in an incorrect configuration and thus do not imply the presence of faces, (iii) images of objects. Results showed that rOFA, rpFFA and raFFA are activated almost simultaneously at a late stage, implying a parallel contextual modulation of the core faceprocessing network. This result further emphasize the importance of internal face features in driving the sequential OFA to FFA processing, and help our understanding of the dynamics of contextual modulation in face perception.

As described in #4 above, we collected data from additional 15 subjects for the Mooney face experiment (normal unfamiliar faces vs. Mooney faces). Combined with previous data, nonparametric permutation tests were performed to check the significance level of observed time difference between rOFA and rpFFA. The results are consistent with previous ones with enhanced statistical power (see Results).

Reviewer #4:

[…] A point that is lacking significantly is the role of pSTS. We know pSTS is mostly involved in the analysis of facial muscle articulations (also called action units, AUs) and the interpretation of facial expressions and emotion, see Srinivasan et al., 2016, and Martinez, 2017. Also relevant is the role of low-level image features (Weibert et al., 2018), which is also missing from the Discussion; and, the role of color perception (Yip and Sinha, 2002; Benitez-Quiroz et al., 2018).

The temporal responses of bilateral pSTS are broader (multi-peaked) and showed lower signal-to-noise than the ventral face-selective areas (Figure 2 and Figure 2—figure supplement 1). To increase our confidence about the pSTS time course, we analyzed the temporal responses of bilateral pSTS evoked by normal faces based on the additional data (Experiment 2), and the time courses basically remained the same as the previous ones (regardless of the task and face familiarity). We have added more discussion about the role of pSTS and its dynamics, especially in relation to the processing of facial expression, muscle articulations and motion.

We also thank the reviewer for reminding us about the role of low-level features including color, and have added more discussion about their role in face processing.

Another point that needs further discussion is the role of internal versus external face features (Sinha et al., 2006), and context (Sinha, Science 2004; Martinez, 2019).

These discussions are essential to frame the results of the present paper within existing models of face perception. With appropriate changes, this could be a strong paper.

We agree that it is important to understand the role of internal versus external face features. Since we were going to obtain more experimental data during the revision, we made the efforts to performed additional MEG experiments to specifically investigate the role of internal versus external face features and context (see response to editor’s #4). We have also added more discussion about them.

https://doi.org/10.7554/eLife.48764.sa2

Article and author information

Author details

  1. Xiaoxu Fan

    1. State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
    2. University of Chinese Academy of Sciences, Beijing, China
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Visualization,Methodology, Writing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8115-8621
  2. Fan Wang

    1. State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
    2. University of Chinese Academy of Sciences, Beijing, China
    Contribution
    Methodology, Data acquisition, Data analysis, Funding acquisition
    Competing interests
    No competing interests declared
  3. Hanyu Shao

    State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
    Contribution
    Methodology, Data acquisition
    Competing interests
    No competing interests declared
  4. Peng Zhang

    1. State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
    2. University of Chinese Academy of Sciences, Beijing, China
    Contribution
    Methodology, Data analysis, Funding acquisition
    Competing interests
    No competing interests declared
  5. Sheng He

    1. State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
    2. University of Chinese Academy of Sciences, Beijing, China
    3. Department of Psychology, University of Minnesota, Minneapolis, United States
    Contribution
    Conceptualization, Investigation, Methodology, Writing, Funding acquisition, Project administration
    For correspondence
    sheng@umn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5547-923X

Funding

Beijing Science and Technology Project (Z181100001518002)

  • Sheng He

Ministry of Science and Technology of the People's Republic of China (2015CB351701)

  • Fan Wang

Bureau of International Cooperation, Chinese Academy of Sciences (153311KYSB20160030)

  • Peng Zhang

Beijing Science and Technology Project (Z171100000117003)

  • Sheng He

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Daniel Kersten for helpful comments on the manuscript and Ling Liu for her help in MEG data analysis. This work was supported by the Beijing Science and Technology Project (Z181100001518002, Z171100000117003), the Ministry of Science and Technology of China grants (2015CB351701) and Bureau of International Cooperation, Chinese Academy of Sciences (153311KYSB20160030).

Ethics

Human subjects: All subjects (age range 19-31) provided written informed consent and consent to publish before the experiments, and experimental protocols were approved by the Institutional Review Board of the Institute of Biophysics, Chinese Academy of Sciences (# 2017-IRB-004). The image used in Figure 3 is a photograph of one of the authors and The Consent to Publish Form was obtained.

Senior Editor

  1. Joshua I Gold, University of Pennsylvania, United States

Reviewing Editor

  1. Ming Meng, South China Normal University, China

Reviewer

  1. Ming Meng, South China Normal University, China

Publication history

  1. Received: May 24, 2019
  2. Accepted: January 10, 2020
  3. Accepted Manuscript published: January 14, 2020 (version 1)
  4. Version of Record published: February 4, 2020 (version 2)

Copyright

© 2020, Fan et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,250
    Page views
  • 217
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Structural Biology and Molecular Biophysics
    Xiakun Chu et al.
    Research Article

    The way in which multidomain proteins fold has been a puzzling question for decades. Until now, the mechanisms and functions of domain interactions involved in multidomain protein folding have been obscure. Here, we develop structure-based models to investigate the folding and DNA-binding processes of the multidomain Y-family DNA polymerase IV (DPO4). We uncover shifts in folding mechanism among ordered domain-wise folding, backtracking folding, and cooperative folding, modulated by interdomain interactions. These lead to "U-shaped' folding kinetics. We characterize the effects of interdomain flexibility on the promotion of DPO4-DNA (un)binding, which probably contributes to the ability of DPO4 to bypass DNA lesions, a known biological role of Y-family polymerases. We suggest that the native topology of DPO4 leads to a trade-off between fast, stable folding and tight functional DNA binding. Our approach provides an effective way to quantitatively correlate the roles of protein interactions in conformational dynamics at the multidomain level.

    1. Computational and Systems Biology
    2. Neuroscience
    Dennis Segebarth et al.
    Research Article

    Bioimage analysis of fluorescent labels is widely used in the life sciences. Recent advances in deep learning (DL) allow automating time-consuming manual image analysis processes based on annotated training data. However, manual annotation of fluorescent features with a low signal-to-noise ratio is somewhat subjective. Training DL models on subjective annotations may be instable or yield biased models. In turn, these models may be unable to reliably detect biological effects. An analysis pipeline integrating data annotation, ground truth estimation, and model training can mitigate this risk. To evaluate this integrated process, we compared different DL-based analysis approaches. With data from two model organisms (mice, zebrafish) and five laboratories, we show that ground truth estimation from multiple human annotators helps to establish objectivity in fluorescent feature annotations. Furthermore, ensembles of multiple models trained on the estimated ground truth establish reliability and validity. Our research provides guidelines for reproducible DL-based bioimage analyses.