Bottom-up and top-down computations in word- and face-selective cortex

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
References
Article and author information
Metrics

Abstract

The ability to read a page of text or recognize a person's face depends on category-selective visual regions in ventral temporal cortex (VTC). To understand how these regions mediate word and face recognition, it is necessary to characterize how stimuli are represented and how this representation is used in the execution of a cognitive task. Here, we show that the response of a category-selective region in VTC can be computed as the degree to which the low-level properties of the stimulus match a category template. Moreover, we show that during execution of a task, the bottom-up representation is scaled by the intraparietal sulcus (IPS), and that the level of IPS engagement reflects the cognitive demands of the task. These results provide an account of neural processing in VTC in the form of a model that addresses both bottom-up and top-down effects and quantitatively predicts VTC responses.

https://doi.org/10.7554/eLife.22341.001

eLife digest

As your eyes scan this page, your visual system performs a series of computations that allow you to derive meaning from the printed words. The visual system solves this task with such apparent ease that you may never have thought about the challenges that your brain must overcome for you to read a page of text. The brain must overcome similar challenges to enable you to recognize the faces of your friends.

Two factors affect how the neurons in the visual system respond to what you are looking at: the physical features of the object and your cognitive state (for example, your knowledge, past experiences, and the cognitive demands of the task at hand). To figure out exactly how these factors influence the responses of the neurons, Kay and Yeatman used functional magnetic resonance imaging to scan the brains of human volunteers as they viewed different images (some of which were of faces or words). The volunteers had to perform various tasks while viewing the images. These tasks included focusing their attention on a small dot, categorizing the image, and stating whether the image had previously been shown.

From the brain imaging data, Kay and Yeatman developed a model of the brain circuits that enable faces and words to be recognized. The model separately characterizes the influence of physical features and the influence of cognitive state, and describes several different types of processing: how the brain represents what is seen, how it makes a decision about how to respond, and how it changes its own activity when carrying out the decision. The model has been made freely available online so that other researchers can reproduce and build upon Kay and Yeatman’s findings.

The current model is not perfect, and it does not describe neural activity in fine detail. However, by obtaining new experimental measurements, the model could be systematically improved.

https://doi.org/10.7554/eLife.22341.002

Introduction

How does visual cortex work? One approach to answering this question consists in building functional models that characterize the computations that are implemented by neurons and their circuitry (Hubel and Wiesel, 1963; Heeger et al., 1996). This approach has been fruitful for the front end of the visual system, where relatively simple image computations have been shown to characterize the spiking activity of neurons in the retina, thalamus, and V1 (Carandini et al., 2005; Wu et al., 2006). Based on this pioneering work in electrophysiology, researchers have extended the modeling approach to characterize responses in human visual cortex, as measured by functional magnetic resonance imaging (fMRI) (Wandell, 1999; Dumoulin and Wandell, 2008; Kay et al., 2008).

Models of early visual processing have been able to offer accurate explanations of low-level perceptual functions such as contrast detection (Ress et al., 2000; Ress and Heeger, 2003) and orientation discrimination (Bejjanki et al., 2011). However, these models are insufficient to explain high-level perceptual functions such as the ability to read a page of text or recognize a face. These abilities are believed to depend on category-selective regions in ventral temporal cortex (VTC), but the computations that give rise to category-selective responses are poorly understood.

The goal of the present study is to develop a model that predicts fMRI responses in high-level visual cortex of human observers while they perform different cognitive tasks on a wide range of images. We seek a model that is fully computable—that is, a model that can operate on any arbitrary visual image and quantitatively predict BOLD responses and behavior (Kay et al., 2008; Huth et al., 2012; Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014). Achieving this goal requires four innovations: First, we need to develop a forward model that characterizes the relationship between visual inputs and the BOLD response in word- and face-selective cortex. Second, we need to dissociate bottom-up stimulus-driven effects from modulation by top-down cognitive processes and characterize how these processes alter the stimulus representation. Third, we need to localize the source of the top-down effects and integrate bottom-up and top-down computations into a single consolidated model. Finally, the neural computations should be linked to the measured behavior of the visual observer. In this study, we make progress on these four innovations and develop a model that characterizes bottom-up and top-down computations in word- and face-selective cortex.

Results

VTC responses depend on both stimulus properties and cognitive task

Ventral temporal cortex (VTC) is divided into a mosaic of high-level visual regions that respond selectively to specific image categories, and are believed to play an essential role in object perception (Kanwisher, 2010; Dehaene and Cohen, 2011). We focus on two specific VTC regions, the visual word form area (VWFA), which selectively responds to words (Cohen et al., 2000, 2002; Wandell et al., 2012), and the fusiform face area (FFA), which selectively responds to faces (Kanwisher et al., 1997; Grill-Spector and Weiner, 2014).

We measured blood oxygenation level dependent (BOLD) responses to a set of carefully controlled images while manipulating the cognitive task that the subjects performed on the stimuli. The first task was designed to minimize the influence of cognitive processes on sensory processing of the stimulus. Subjects performed a demanding perceptual task on a small dot (0.12° × 0.12°) presented at fixation. In this fixation task, the presented stimuli are irrelevant to the subject, and we interpret evoked activity as reflecting primarily the intrinsic, bottom-up response from VTC. We acknowledge that the fixation task may not perfectly isolate bottom-up responses. For example, high-contrast stimuli may automatically attract attention. Moreover, there are other potential interpretations of the fixation task: for example, allocating attention to the small fixation dot might engage active suppression of responses to the presented stimuli.

To a first approximation, much of the variance in the bottom-up fixation responses from VWFA and FFA is explained by the category of stimulus (Figure 1d, red lines). However, we find that responses are not invariant to low-level properties of the stimulus: both image contrast and phase coherence modulate response amplitudes. For example, the response to a word in VWFA is 2.4 times stronger when the word is presented at 100% contrast as compared to 3% contrast. These bottom-up effects (see also Rainer et al., 2001; Avidan et al., 2002; Yue et al., 2011; Nasr et al., 2014) may be somewhat surprising given that theories of word recognition generally posit that the VWFA response is invariant to low-level features (Dehaene and Cohen, 2007, 2011; Price and Devlin, 2011). In fact, it is currently debated whether the VWFA should be considered a visual area or a ‘meta modal’ language region (Reich et al., 2011; Striem-Amit et al., 2012). Our measurements indicate that when top-down signals are minimized, word- and face-selective cortex is sensitive to low-level image properties, and that an accurate model of the computations performed by these regions must consider not only the stimulus category but also low-level features of the stimulus.

Figure 1 with 1 supplement see all

Download asset Open asset

VTC responses depend on both stimulus properties and cognitive task.

(a) *Stimuli*. Stimuli included faces, words, and noise patterns presented at different contrasts and phase-coherence levels, as well as full-contrast polygons, checkerboards, and houses. (b) *Trial design*. Each trial consisted of four images drawn from the same stimulus type. (c) *Tasks*. On a given trial, subjects performed one of three tasks. (d) *Evoked responses in VWFA (top) and FFA (bottom) for different stimuli and tasks*. Color of x-axis label indicates the perceived stimulus category as reported by the subjects. Error bars indicate bootstrapped 68% CIs.

https://doi.org/10.7554/eLife.22341.003

We also measured VTC responses while subjects performed a categorization task, in which the subject reports the perceived category of the stimulus, and a one-back task, in which the subject detects consecutive repetitions of stimulus frames. Despite the presentation of identical stimuli across the three tasks, there are substantial changes in evoked VTC responses. Responses are larger for the categorization (Figure 1d, green lines) and one-back tasks (Figure 1d, blue lines) compared to the fixation task (Figure 1d, red lines), and we interpret these response increases as reflecting top-down modulation. In some cases, the top-down modulation is even larger than the modulation achieved by manipulation of the stimulus. For example, the VWFA response to 3%-contrast words during the one-back task exceeds the response to 100%-contrast words. Note that the task effects cannot be explained simply by differences in spatial attention: the one-back task produces substantially larger responses than the categorization task despite the fact that both tasks require the locus of spatial attention to be on the stimulus. Task effects in lower-level areas exist but are smaller in size (Figure 1—figure supplement 1).

A potential explanation of the top-down modulation is differences in task difficulty (Ress et al., 2000). For example, it is presumably more difficult to perceive low-contrast stimuli than high-contrast stimuli, and this may explain why there is a large response enhancement for low- but not high-contrast stimuli (see VWFA contrast-response function for word stimuli in Figure 1d). Later in this paper, we provide a computational mechanism that could underlie the psychological concept of task difficulty.

In summary, our measurements indicate that VTC responses cannot be interpreted without specifying the cognitive state of the observer. A complete model of the computations performed by VWFA and FFA must consider the cognitive task in addition to stimulus properties.

Model of bottom-up computations in VTC

Before addressing the influence of top-down factors, we first develop a model of bottom-up responses in VWFA and FFA. Although the field has long understood that stimulus category is a good predictor of evoked responses (Kanwisher et al., 1997; Kriegeskorte et al., 2008; Grill-Spector and Weiner, 2014), we do not yet have a computational explanation of this phenomenon. In other words, although we are able to use our own visual systems to assign a label such as ‘word’ or ‘face’ to describe the data, we have not yet identified the operations that enable our visual systems to derive these labels in the first place. An additional limitation of our conceptual understanding is that it fails to account for the sensitivity of VWFA and FFA to low-level image properties. We therefore ask: Is it possible to develop a quantitative characterization of the bottom-up computations that can reproduce observed stimulus selectivity in human VTC?

Extending an existing computational model of fMRI responses in the visual system (Kay et al., 2008, 2013b, 2013c), we conceive of a model involving two stages of image computations (Figure 2a). The first stage consists of a set of local oriented filters, akin to what has been used to model physiological responses in V1 (Jones and Palmer, 1987; Carandini et al., 2005). The second stage consists of a normalized dot product applied to the outputs of the first stage. This dot product computes how well a given stimulus matches a category template (for example, a word template for VWFA, a face template for FFA). We construct category templates directly from the stimulus set used in the experiment; in a later section we explore how well this approach generalizes. The present model, termed the Template model, is almost certainly an oversimplification of the complex nonlinear processing performed in VTC. Nevertheless, the model is theoretically motivated, consistent with hierarchical theories of visual processing (Fukushima, 1980; Heeger et al., 1996; Serre et al., 2007; DiCarlo et al., 2012; Rolls, 2012), and provides a useful starting point for characterizing the computations that underlie word- and face-selectivity. Moreover, unlike recently popular deep neural network models that also involve hierarchical processing (Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Güçlü and van Gerven, 2015), the model we propose is parsimonious with only three free parameters, and is therefore straightforward to fit and interpret (see Figure 2a and Materials and methods).

Figure 2 with 1 supplement see all

Download asset Open asset

Model of bottom-up computations in VTC.

(a) *Model architecture.* The predicted response of the Template model is given by a series of image computations (see Materials and methods). (b) *Cross-validation performance.* Black bars indicate bottom-up stimulus-driven responses measured during the fixation task, dark lines and dark dots indicate model predictions (leave-one-stimulus-out cross-validation), and light lines and light dots indicate model fits (no cross-validation). Scatter plots in the inset compare model predictions against the data. The Template model is compared to the Category model which simply predicts a fixed response level for stimuli from the preferred stimulus category and a different response level for all other stimuli (the slight decrease in response as a function of contrast is a result of the cross-validation process). (c) *Comparison of performance against control models.* Bars indicate leave-one-stimulus-out cross-validation performance. Error bars indicate 68% CIs, obtained by bootstrapping (resampling subjects with replacement). Solid horizontal lines indicate the noise ceiling, that is, the maximum possible performance given measurement variability in the data. Dotted horizontal lines indicate the cross-validation performance of a model that predicts the same response level for each data point (this corresponds to R² = 0 in the conventional definition of R² where variance is computed relative to the mean). The performance of the Template model degrades if the second stage of nonlinearities is omitted (Template model (only subtractive normalization)) or if the first stage of the model involving V1-like filtering is omitted (Template model (omit first stage)). The plot also shows that the precise configuration of the template is important for achieving high model performance (Template model (non-selective, mixed, random templates)). (d) *Performance as a function of spatial frequency tuning.* Here we manipulate the spatial frequency tuning of the filters in the Template model (while fixing spatial frequency bandwidth at one octave). The Template model uses a single set of filters at a spatial frequency tuning of 4 cycles/degree.

https://doi.org/10.7554/eLife.22341.005

Applying the Template model to responses measured during the fixation task, we find that the Template model accurately predicts a large amount of variance in the responses of VWFA and FFA (Figure 2b). The model outperforms a phenomenological model, termed the Category model, that posits that perceived stimulus category is sufficient to predict the response of category-selective regions. Notably, the Template model is able to predict the response to non-preferred stimulus categories in each ROI. This suggests that responses to non-preferred stimuli are meaningful and the result of a well-defined computation performed by the visual system (Haxby et al., 2001). The model also outperforms simplified versions of the Template model that include only one of the two processing stages, as well as versions of the Template model in which the category template lacks tuning (non-selective template), is equally weighted between words and faces (mixed template), or is constructed randomly (random template) (Figure 2c).

The experiment we have conducted explores a limited range of stimuli. To further assess how well the Template model generalizes, we collected an additional dataset that includes 92 images taken from a previous study of object representation (Kriegeskorte et al., 2008). In its original instantiation (Figure 2a), the Template model uses category templates tailored to the stimuli in the main experiment, and we find that this instantiation of the Template model does not generalize well to the larger stimulus set (Figure 2—figure supplement 1b). However, by implementing a simple model extension in which we use a data-driven approach to estimate category templates, we find that the Template model achieves a reasonable level of accuracy on the new stimulus set (Figure 2—figure supplement 1d). This finding validates the basic architecture of the Template model, demonstrates how the Template model might be extended to account for increasingly large ranges of measurements, and provides a promising method to model response properties of other high-level visual regions not investigated here (for example, place- and limb-selective cortex).

The Template model advances us towards a computational understanding of VTC by demonstrating that BOLD responses in VTC can be predicted based on a template-matching operation on incoming visual inputs filtered by early visual cortex. The present results indicate that although high-level representations are not identical to low-level properties, they are built from, and fundamentally tied to, low-level properties through a series of linear and nonlinear operations. This conclusion is consistent with classic hierarchical theories of visual cortex (Fukushima, 1980; Heeger et al., 1996; Serre et al., 2007; DiCarlo et al., 2012; Rolls, 2012) and recent evidence that visual features may explain semantic representations found in high-level visual cortex (Jozwik et al., 2016). Our model can be viewed as a potential mechanism for how semantic tuning properties emerge in visual cortex (Huth et al., 2012). Our results indicate when studying high-level sensory representations in the brain, a precise characterization of the stimulus still matters.

Top-down modulation acts as a stimulus-specific scaling

While the Template model explains bottom-up responses of VWFA and FFA as indexed by the fixation task, it does not explain why responses are higher in these areas during the categorization and one-back tasks (see Figure 1d). This is simply because the stimuli are identical across the three tasks and the response of the Template model, like that of many computational models of visual processing, is solely a function of the stimulus. Before we can design a model to capture the top-down effects, we must first understand exactly how top-down signals shape the VTC response.

By visualizing VTC responses as points in a multi-dimensional neural space with VWFA, FFA, and hV4 BOLD response amplitudes as the axes, we see that the responses to words and faces lie on specific manifolds, appearing as ‘arms’ that emanate from the origin (Figure 3). Importantly, we observe that the categorization and one-back tasks act as a scaling mechanism on the representation observed during the fixation task. The scaling mechanism moves the representation of each stimulus along the arms and away from the origin. Moreover, the amount of scaling is not constant across stimuli but is stimulus-specific, and this is most evident when considering the lowest contrast stimuli (Figure 3, black dots).

Figure 3

Download asset Open asset

Top-down stimulus-specific scaling of VTC representation.

(a) *Responses plotted in multi-dimensional neural space*. Each dot indicates ROI (VWFA, FFA) responses to a stimulus. In each plot, the black line indicates a linear decision boundary separating words and faces (nearest-centroid classifier, angular distance). (b) *Schematics of potential top-down mechanisms* (these models are formally evaluated in Figure 5c; see Materials and methods section ‘IPS-scaling model’ for details). (c) *Categorization and one-back tasks produce stimulus-specific scaling*. Arrows indicate the change in representation compared to the fixation task. (d) *Scaling improves readout*. Each data point indicates the signed Euclidean distance between the word-face decision boundary (as determined from the one-back task) and the neural response to a single stimulus. Lines join data points that correspond to the same stimulus. The scaling observed during the categorization and one-back tasks moves responses away from the decision boundary, thereby improving signal-to-noise ratio. (e) *Separation of other stimulus categories*. Including hV4 as a third dimension reveals that stimuli categorized as neither words nor faces manifest as a third ‘arm’ that emanates from the origin. Although not reported to be a word by the subjects, the polygon stimulus behaves similarly to word stimuli.

https://doi.org/10.7554/eLife.22341.007

The visualization also shows that substantial responses to non-preferred categories are present in each ROI (for example, faces in VWFA, words in FFA) and that these responses are scaled during the stimulus-directed tasks. Thus, not only is information regarding non-preferred categories present in each ROI, but this information is actively modulated when subjects perform a perceptual task on those categories. These observations support the view that the brain uses a distributed strategy for perceptual processing and that category-selective regions are components of a more general network of regions that coordinate to extract visual information (Haxby et al., 2001; Cox and Savoy, 2003). An alternative scheme, more in line with a modular view of perceptual processing (Kanwisher and Wojciulik, 2000; Baldauf and Desimone, 2014), is area-specific enhancement, in which the representation of a stimulus is enhanced only in the region that is selective for that stimulus (for example, enhancement of words only in VWFA, enhancement of faces only in FFA). This scheme is not supported by our measurements (Figures 3b and 3c; formal model evaluation is performed in a later section). Rather, response scaling occurs even for non-preferred stimulus categories, and the amount of scaling varies as a function of stimulus properties such as image contrast.

A simple interpretation of the scaling effects is that they serve to increase signal-to-noise ratio in visually evoked responses in VTC (Brouwer and Heeger, 2013). For example, assuming that one use of the stimulus representation in VTC is to discriminate whether the presented stimulus is a word or face (or, more generally, identify the category of the stimulus [DiCarlo et al., 2012]), the scaling induced by the stimulus-directed tasks serves to increase the distance of neural responses from a linear decision boundary that separates words and faces (Figure 3d). Interestingly, the categorization and one-back tasks appear to act via the same scaling mechanism. The stronger scaling observed for the one-back task might be a consequence of increased amplitude or duration of neural activity. These results suggest that, at least for the perceptual tasks sampled here and the spatial scale of neural activity measured in this study, top-down cognitive processes do not impart additional tuning or selectivity but serve to amplify the selectivity that is already computed by visual cortex.

IPS is the source of top-down modulation to VTC

To design a plausible model that can predict top-down effects, we next turn to identifying the neural circuitry that generates task modulations in VTC. There are two candidate mechanisms. The first is that sensitivity to task is locally generated from the neuronal architecture of VTC itself. We explore an alternative hypothesis whereby top-down modulation is induced by input from another brain region that is sensitive to task demands. To identify this region, we perform a connectivity analysis in which we first subtract the bottom-up signal in VTC, as given by responses measured during the fixation task, from responses measured during the categorization and one-back tasks. We then correlate these residuals, which isolate the top-down signal, against the responses of every cortical location.

Applying this connectivity analysis to our data, we find that responses in the intraparietal sulcus (IPS) predict the top-down enhancement of VTC responses (Figure 4b) better than responses in any other region of cortex. As a control, if we omit the subtraction step and simply correlate raw VTC responses with the responses of different cortical locations, we find that the correlation is instead strongest with a range of areas spanning occipital cortex (Figure 4a). This indicates that the VTC response is a mixture of bottom-up and top-down effects and that the top-down influence from the IPS becomes clear only when bottom-up effects are removed. Comparing our results to a publicly available atlas (Wang et al., 2015), we estimate that the source of top-down modulation is localized to the IPS-0 and IPS-1 subdivisions of the IPS (see also Figure 4—figure supplement 1).

Figure 4 with 1 supplement see all

Download asset Open asset

IPS is the source of top-down modulation to VTC.

(a) *Correlation with raw VTC response*. This map depicts the correlation between the VTC response observed during the categorization and one-back tasks with the response at each cortical location (inset shows an unsmoothed and unthresholded map). Positive correlations are broadly distributed across occipital cortex. Results are shown for subjects with whole-brain coverage (n = 3); results for other subjects with partial-brain coverage (n = 6) are shown in Figure 4—figure supplement 1. (b) *Correlation with top-down component of VTC response*. After removing bottom-up responses (fixation task), the correlation is spatially localized to a hotspot in IPS-0/1. (c) *Tractography using diffusion MRI*. We find that the vertical occipital fasciculus (Yeatman et al., 2014) connects VWFA and FFA to the IPS hotspot in each subject for which diffusion data were collected (n = 8) (rendering shows a representative subject).

https://doi.org/10.7554/eLife.22341.008

Previous research has identified IPS as playing a key role in controlling spatial attention (Saalmann et al., 2007; Lauritzen et al., 2009). Our results extend these findings by showing that, despite the fact that spatial attention is always directed towards the foveal stimulus during the categorization and one-back tasks, the amount of modulation from the IPS is flexible and varies depending on properties of the stimulus and demands of the task. For example, during the categorization task, the observed enhancement for low-contrast stimuli is much larger than that for high-contrast stimuli. This mechanism could explain the finding that difficult tasks enhance visual responses (Ress et al., 2000).

The direct influence of IPS on neural responses in VTC is consistent with anatomical measurements demonstrating the existence of a large white-matter pathway connecting dorsal and ventral visual cortex, called the vertical occipital fasciculus (VOF) (Yeatman et al., 2013, 2014; Takemura et al., 2016). Using diffusion-weighted MRI and tractography (data acquired in 8 of 9 subjects), we show that the VOF specifically connects the VWFA and FFA with the functionally identified peak region in the IPS (Figure 4c). The VWFA falls within the ventral terminations of the VOF for seven subjects and, for the eighth, the VWFA is 2.7 mm anterior to the VOF, well within the margin of error for tractography (Jeurissen et al., 2011). The FFA falls within the ventral terminations of the VOF for all eight subjects. These results provide an elegant example of how anatomy subserves function, and sets the stage for a circuit-level computational model that, guided by anatomical constraints, characterizes the computations that emerge from interactions between multiple brain regions.

Model of top-down computations in VTC

The previous two sections provide critical insights that set the stage for building a quantitative model that predicts top-down effects in VTC. Building upon the observation that top-down modulation acts as a scaling mechanism on responses in VWFA and FFA (see Figure 3) and the observation that top-down effects are correlated with the IPS signal (see Figure 4), we propose that the magnitude of the IPS response to a stimulus indicates the amount of top-down scaling that is applied to bottom-up sensory responses in VTC (Figure 5a). We implement this model, termed the IPS-scaling model, using response magnitudes extracted from a broad anatomical mask of the IPS. This strategy helps avoids the overfitting that might ensue from a more specific voxel-selection procedure tailored to the fine-scale and potentially idiosyncratic pattern of results from the connectivity analysis. For example, if we were to select the single cortical location in the IPS that best correlates with the top-down modulation of VTC, this would make voxel selection a critical part of the model and render the modeling analysis circular (Kriegeskorte et al., 2009). Nevertheless, the selection procedure is not completely independent, so the modeling results should not be viewed as providing independent evidence for the involvement of the IPS.

Figure 5

Download asset Open asset

Model of top-down computations in VTC.

(a) *Model architecture.* The predicted response during the stimulus-directed tasks (categorization task, one-back task) is given by scaling the bottom-up response, with the amount of scaling proportional to the IPS signal. (b) *Cross-validation performance.* Same format as Figure 2b. The arrows highlight an example of how the bottom-up response (red arrow) is multiplied by the IPS signal (green arrow) to produce the predicted response (blue arrow). (c) *Comparison of performance against alternative models.* Same format as Figure 2c (some error bars do not include the bar height; this is a consequence of the bootstrap procedure). Although the Additive and Scaling models perform well, note that these are *ad hoc*, phenomenological models. For instance, the Scaling model (task-specific) posits separate parameters for the amount of scaling under the categorization and one-back tasks. However, such a model does not explain *why* there is a different amount of scaling, whereas the IPS-scaling model provides such an explanation.

https://doi.org/10.7554/eLife.22341.010

We find that the IPS-scaling model accurately characterizes the observed data (Figure 5b). For example, notice that the FFA response to faces increases gradually for each contrast increment during the fixation task (relatively unsaturated contrast-response function, red arrow). When subjects perform the one-back task, we observe a U-shaped contrast-response function in IPS (green arrow); multiplication of the two functions predicts a contrast-response function that is highly saturated and accurately matches the observed contrast-response function in FFA during the one-back task (blue arrow).

Importantly, the IPS-scaling model uses a single set of scale and offset parameters on the IPS response and accurately predicts scaling of VTC responses across the categorization and one-back tasks (Figure 5b, top plot). This finding suggests that the scaling of VTC by IPS is a general mechanism supporting perception and is independent of the specific cognitive task performed by the observer. Furthermore, the scale and offset parameters that are estimated from the data show that when IPS exhibits close to zero evoked activity (for example, FACE at 100%-contrast; see Figure 1—figure supplement 1), the corresponding scaling factor is close to one. This has a sensible interpretation: when IPS is inactive, we observe only the bottom-up response in VTC and no top-down modulation.

We assessed the cross-validation performance of the IPS-scaling model in comparison to several alternative models of top-down modulation (including those schematized earlier in Figure 3b). In line with earlier observations (Figure 3b and c), we find that a model positing enhancement for only the preferred stimulus category of each area (Area-specific enhancement model) does not optimally describe the data. We find that a phenomenological scaling model (Scaling model (task-specific)) outperforms a phenomenological additive model (Additive model (task-specific)), confirming earlier observations that the top-down modulation is a scaling effect (Figure 5c). This conclusion is further supported by the higher performance observed when the IPS interacts with VTC multiplicatively (IPS-scaling model) compared to when it interacts additively (IPS-additive model). Finally, we find that the performance of the IPS-scaling model degrades if the IPS input into the model is shuffled across conditions (IPS-scaling model (shuffle, shuffle within task)), confirming that top-down modulation from the IPS is dependent on the stimulus and task.

Is the IPS the only region that induces top-down modulation of VTC? Inspection of the connectivity results (see Figure 4b) reveals that the top-down residuals in VTC are correlated, to a lesser extent, with the responses of other regions. These weaker correlations might be incidental, or might capture other important signals. Given that the IPS-scaling model accounts for nearly all of the variance induced by top-down modulation of VTC (see Figure 5b and c), we suggest that it is sufficient to consider only the IPS for the current set of measurements. However, future measurements that employ new stimulus manipulations and other cognitive tasks may reveal the role of a more extensive brain network. The IPS-scaling model can be extended to account for new measurements by systematically parameterizing the connectivity with additional brain regions. For example, some models of reading posit that language-related regions can directly influence the VWFA (Twomey et al., 2011), suggesting that to account for measurements made during a more naturalistic reading task, it may be necessary to include Broca’s area in the model.

Model of perceptual decision-making in IPS

Although informative, the finding that IPS provides top-down stimulus-specific scaling of VTC is an incomplete explanation, as the burden of explaining the top-down effects is simply shifted to the IPS. We are thus left wondering: is it possible to explain the response profile of the IPS? In particular, can we explain why the IPS is more active for certain stimuli compared to others? Answering these questions will provide a critical link between cognitive state and IPS activity.

Inspired by previous research on perceptual decision-making (Shadlen and Newsome, 2001; Heekeren et al., 2004; Gold and Shadlen, 2007; Kayser et al., 2010a), we implement a Drift diffusion model that attempts to account for IPS responses measured during the categorization task (Figure 6a). The model uses VTC responses during the fixation task as a measure of sensory evidence, and posits that the IPS accumulates evidence from VTC over time and exhibits an activity level that is monotonically related to accumulation time. For example, when VTC responses are small, as is the case for low-contrast stimuli, sensory evidence for stimulus category is weak, leading to long accumulation times (indexed by measurements of reaction time during the experiment), and large IPS responses.

Figure 6

Download asset Open asset

Model of perceptual decision-making in IPS.

(a) *Model architecture.* We implement a model that links the stimulus representation in VTC to a decision-making process occurring in IPS. The model first uses the bottom-up VTC response as a measure of sensory evidence and predicts reaction times in the categorization task. The model then predicts the IPS response as a monotonically increasing function of reaction time. Note that this model does not involve stochasticity in the evidence-accumulation process, and is therefore a simplified version of the classic drift diffusion model (Ratcliff, 1978). (b) *Cross-validation performance.* Same format as Figure 2b (except that reaction times are modeled in the left plot). (c) *Comparison of performance against control models.* The performance of the Drift diffusion model does not degrade substantially if a single threshold is used, thus justifying this simplification. Performance degrades if axis-aligned category vectors are used, supporting the assertion that responses of multiple VTC regions are used by subjects in deciding image category. (d) *Overall model architecture*. This schematic summarizes all components of our computational model (Figures 2a, 5a and 6a). Bottom-up visual information is encoded in the VTC fixation response (green box; Template model), fixation responses are routed to the IPS for evidence accumulation (purple box; Drift diffusion model), and then feedback from the IPS to VTC causes top-down modulation during the categorization and one-back tasks (yellow box; IPS-scaling model).

https://doi.org/10.7554/eLife.22341.011

Our implementation of the Drift diffusion model involves two steps. First, we use VTC responses during the fixation task (reflecting sensory evidence) to predict reaction times measured in the categorization task. The quality of the predictions is quite high (Figure 6b, left). Second, we apply a simple monotonic function to the reaction times measured during the categorization task to predict the level of response in the IPS (see Materials and methods). The rationale is that neural activity in IPS is expected to be sustained over the duration of the decision-making process (Shadlen and Newsome, 2001), and so the total amount of neural activity integrated over time should be larger for longer decisions. Assuming that the BOLD signal reflects convolution of a sluggish hemodynamic response function and fine-scale neural activity dynamics, small differences in the duration of neural activity (for example, between 0 and 2 s) are expected to manifest in differences in BOLD amplitudes (Kayser et al., 2010a) and only minimally in the shapes of BOLD timecourses. The cross-validated predictions of our proposed model explain substantial variance in IPS (Figure 6b, right).

It is possible to offer a psychological explanation of IPS activity as reflecting task difficulty—for example, we can posit that IPS activity is enhanced for low-contrast stimuli because the observer works harder to perceive these stimuli. The value of the model we have proposed is that it provides a quantitative and formal explanation of the computations that underlie ‘difficulty’. According to the model, categorization of low-contrast stimuli is difficult because the IPS computations required to perform the task involve longer accumulation time, and this is reflected in the fact that IPS response magnitudes increase monotonically with reaction time. Thus, our model performs several critical functions: it relates the cognitive task performed by the subject to IPS activity, proposes a computational explanation of task difficulty, and posits that top-down modulation of VTC by IPS is a direct consequence of fulfilling task demands. We have substantiated this hypothesis for the categorization task and suggest that this will serve as a foundation for modeling more complex cognitive tasks such as one-back.

Discussion

In summary, we have measured and modeled how bottom-up and top-down factors shape responses in word- and face-selective cortex. A template operation on low-level visual properties generates a bottom-up stimulus representation, while top-down modulation from the IPS scales this representation in the service of the behavioral goals of the observer. We develop a computational approach that posits explicit models of the information processing performed by a network of interacting sensory and cognitive regions of the brain and validate this model on experimental data. We make publicly available data and open-source software code implementing the model at Kay, 2017 (with a copy archived at https://github.com/elifesciences-publications/vtcipsmodel).

The model we propose is valuable because it integrates and explains a range of different stimulus and task manipulations that affect responses in VWFA, FFA, and IPS. Response properties in these regions can now be interpreted using a series of simple, well-defined computations that can be applied to arbitrary images. However, it is also important to recognize the limitations of the model. First, we have tested the model on only a limited range of stimuli and cognitive tasks. Second, the accuracy with which the model accounts for the data is reasonable but by no means perfect. For instance, the Template model does not capture the step-like response profile of VTC as phase coherence is varied (see Figure 2b), and the accuracy with which the model accounts for a wide range of stimuli that includes faces, animals, and objects, is moderate at best (see Figure 2—figure supplement 1). Third, we have thus far characterized VTC and IPS responses at only a coarse spatiotemporal scale (that is, BOLD responses averaged over specific regions-of-interest). Given these limitations, the present work constitutes a first step towards the goal of developing a comprehensive computational model of human high-level visual cortex. We have provided data and code so that other researchers can build on our approach, for example, by testing the generalizability of the model to other stimuli and tasks, extending and improving the model, and comparing the model against alternative models.

The fact that cognitive factors substantially affect stimulus representation in visual cortex highlights the importance of tightly controlling and manipulating cognitive state when investigating stimulus selectivity. In the present measurements, the most striking example comes from stimulus contrast. When subjects perform the fixation task, the contrast-response function (CRF) in VWFA is monotonically increasing, whereas during the one-back task, the CRF flips in sign and is monotonically decreasing (see Figure 1d). This effect (similar to what is reported in Murray and He, 2006) is puzzling if we interpret the CRFs as indicating sensitivity to the stimulus contrast, but is sensible if we interpret the CRFs as instead reflecting the interaction of stimulus properties and cognitive processes. The influence of cognition on visual responses forces us to reconsider studies that report unexpected tuning properties in VTC and IPS, such as tuning to linguistic properties of text in VWFA (Vinckier et al., 2007) and object selectivity in parietal cortex (Sereno and Maunsell, 1998). In experiments that do not tightly control the cognitive processes executed by the observer, it is impossible to distinguish sensory effects from cognitive effects. Our quantitative model of VTC-IPS interactions provides a principled baseline on which to re-interpret past findings, design follow-up experiments, and guide data analysis.

There is a large body of literature on characterizing and modeling the effect of spatial attention on contrast-response functions (Luck et al., 1997; Boynton, 2009; Reynolds and Heeger, 2009; Itthipuripat et al., 2014). Although several models have been proposed, none of these straightforwardly account for the present set of measurements. The response-gain model (McAdams and Maunsell, 1999) posits that attention causes a multiplicative scaling of contrast-response functions. Our observations are consistent with the general notion of response scaling (see Figure 3), but importantly, we find that the amount of scaling differs for different stimuli. Whereas the response-gain model implies that scaling is constant and therefore contrast-response functions should grow steeper during the stimulus-directed tasks, we find the opposite (see Figure 1d). The contrast-gain model (Reynolds et al., 2000) posits that attention causes a leftward shift of contrast-response functions (as if contrast were increased). This model does not account for our measurements, since the stimulus-directed tasks can generate responses to low-contrast stimuli that are larger than responses to high-contrast stimuli (see Figure 1d). Finally, the additive-shift model (Buracas and Boynton, 2007) posits that attention causes an additive increment to contrast-response functions; we find that our observations are better explained by a scaling, not additive, mechanism (see Figures 3 and 5c). Thus, the effects we report are novel, and previous models of attention cannot explain these effects. Furthermore, by investigating responses to a wide range of stimuli (including manipulations of not only contrast but also phase coherence and stimulus category) and by characterizing the source of attentional signals, our work develops a more comprehensive picture of information processing in the visual system.

There are a number of research questions that remain unresolved. First, our connectivity analysis and modeling of VTC-IPS interactions are based on the correlation of BOLD responses and do not provide information regarding the directionality, or timing, of neural interactions. In other words, our correlational results do not, in and of themselves, prove that the IPS causes VTC modulation; rather, we are imposing an interpretation of the results in the context of a computational model. We note that our interpretation is in line with previous work on perceptual decision-making showing top-down influence (Granger causality) of IPS on visual cortex (Kayser et al., 2010b). Our working hypothesis is that sensory information arrives at VTC (as indexed by fixation responses), these signals are routed to IPS for evidence accumulation, and then feedback from the IPS modulates the VTC response (as indexed by categorization and one-back responses). Temporally resolved measurements of neural activity (for example, EEG, MEG, ECoG) will be necessary to test this hypothesis. Second, the scaling of BOLD response amplitudes by IPS is consistent with at least two potential mechanisms at neural level: the IPS may be inducing a scaling on neural activity in VTC or, alternatively, a sustainment of neural activity in VTC. Some support for the latter comes from a recent study demonstrating that ECoG responses in FFA exhibit sustained activity that is linked to long reaction times in a face gender discrimination task (Ghuman et al., 2014). Finally, IPS is part of larger brain networks involved in attention (Corbetta and Shulman, 2002) and decision-making (Gold and Shadlen, 2007), and identifying the computational roles of other regions in these networks is necessary for a comprehensive understanding of the neural mechanisms of perception.

Materials and methods

Subjects

Eleven subjects participated in this study. Two subjects were excluded due to inability to identify VWFA in one subject and low signal-to-noise ratio in another subject, leaving a total of nine usable subjects (age range 25–32; six males, three females). All subjects were healthy right-handed monolingual native-English speakers, had normal or corrected-to-normal visual acuity, and were naive to the purposes of the experiment. Informed written consent was obtained from all subjects, and the experimental protocol was approved by the Washington University in St. Louis Institutional Review Board. Each subject participated in 1–3 scanning sessions, over the course of which anatomical data (T1-weighted high-resolution anatomical volume, diffusion-weighted MRI data) and functional data (retinotopic mapping, functional localizer, main experiment) were collected.

Visual stimuli

Stimuli were presented using an NEC NP-V260X projector. The projected image was focused onto a backprojection screen and subjects viewed this screen via a mirror mounted on the RF coil. The projector operated at a resolution of 1024 × 768 at 60 Hz, and the viewing distance was 340 cm. A Macintosh laptop controlled stimulus presentation using code based on the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997). Approximate gamma correction was performed by taking the square root of pixel intensity values before stimulus presentation. Behavioral responses were recorded using a button box.

The experiment consisted of 22 types of stimuli. All stimuli were small grayscale images (approximately 2° × 2°) presented at fixation. Each stimulus type consisted of 10 distinct images (for example, 10 different faces for a face stimulus), and a subset of these images were presented on each given trial.

Share this article

Cite this article

VTC responses depend on both stimulus properties and cognitive task.

Model of bottom-up computations in VTC.

Top-down stimulus-specific scaling of VTC representation.

IPS is the source of top-down modulation to VTC.

Model of top-down computations in VTC.

Model of perceptual decision-making in IPS.

Author details

Kendrick N Kay

Contribution

For correspondence

Competing interests

Jason D Yeatman

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism