The normalization model predicts responses in the human visual cortex during object-based attention
Abstract
Divisive normalization of the neural responses by the activity of the neighboring neurons has been proposed as a fundamental operation in the nervous system based on its success in predicting neural responses recorded in primate electrophysiology studies. Nevertheless, experimental evidence for the existence of this operation in the human brain is still scant. Here, using functional MRI, we examined the role of normalization across the visual hierarchy in the human visual cortex. Using stimuli form the two categories of human bodies and houses, we presented objects in isolation or in clutter and asked participants to attend or ignore the stimuli. Focusing on the primary visual area V1, the object-selective regions LO and pFs, the body-selective region EBA, and the scene-selective region PPA, we first modeled single-voxel responses using a weighted sum, a weighted average, and a normalization model and demonstrated that although the weighted sum and weighted average models also made acceptable predictions in some conditions, the response to multiple stimuli could generally be better described by a model that takes normalization into account. We then determined the observed effects of attention on cortical responses and demonstrated that these effects were predicted by the normalization model, but not by the weighted sum or the weighted average models. Our results thus provide evidence that the normalization model can predict responses to objects across shifts of visual attention, suggesting the role of normalization as a fundamental operation in the human brain.
Editor's evaluation
This study on object-based attention furthers of understanding of the role of normalization across the visual hierarchy in the human visual cortex. The authors provide solid functional MRI evidence that supports their claims, demonstrating that the normalization model predicts the observed effect when participants selectively attend to one of two stimulus categories. The paper is an important contribution to the fields of perceptual and cognitive neuroscience.
https://doi.org/10.7554/eLife.75726.sa0Introduction
The brain makes use of fundamental operations to perform neural computations in various modalities and different regions. Divisive normalization has been proposed as one of these fundamental operations. Under this computation, the response of a neuron is determined based on its excitatory input divided by a factor representing the activity of a pool of nearby neurons (Heeger, 1992; Carandini et al., 1997; Carandini and Heeger, 2011). Normalization was first introduced based on responses in the cat primary visual cortex (Heeger, 1992), and evidence of its operation in higher regions of the monkey visual cortex has also been demonstrated both during passive viewing (Bao and Tsao, 2018) and when attention is directed towards a stimulus (Reynolds and Heeger, 2009; Lee and Maunsell, 2010; Ni et al., 2012; Ni and Maunsell, 2019).
Normalization has also been proposed as a critical operation in the human brain based on evidence demonstrating the sublinear addition of responses to multiple stimuli in the visual cortex (Bloem and Ling, 2019). Nevertheless, in lieu of directly testing the normalization model to resolve multiple-stimulus representation, several previous studies have shown that a weighted average model can account for multiple-stimulus responses in the monkey brain (Zoccolan et al., 2005; Macevoy and Epstein, 2009; Reddy et al., 2009; Kliger and Yovel, 2020). The only exception is a recent electrophysiology study, which showed that in the category-selective regions of the monkey brain, a winner-take-all, but not averaging, rule can explain neural responses in many cases (Bao and Tsao, 2018). Bao and Tsao, 2018 further demonstrated that the normalization model predicts such winner-take-all behavior. It is not clear whether this discrepancy has emerged as a result of different explored regions of the brain, or due to the diversity in stimuli or the task performed by the participants.
In addition to regional computations for multiple-stimulus representation, the visual cortex relies on top-down mechanisms such as attention to select the most relevant stimulus for detailed processing (Moran and Desimone, 1985; Desimone and Duncan, 1995; Chun et al., 2011; Baluch and Itti, 2011; Noudoost et al., 2010; Maunsell, 2015; Thiele and Bellgrove, 2018; Itthipuripat et al., 2014; Moore and Zirnsak, 2017; Buschman and Kastner, 2015). Attention works through increasing the response gain (Treue and Martínez Trujillo, 1999; McAdams and Maunsell, 1999) or contrast gain (Reynolds et al., 2000; Martínez-Trujillo and Treue, 2002) of the attended stimulus. Previous studies have demonstrated how the normalization computation accounts for these observed effects of attention in the monkey brain. They have suggested that normalization attenuates the neural response in proportion to the activity of the neighboring neuronal pool (Reynolds and Heeger, 2009; Ni et al., 2012; Boynton, 2009; Lee et al., 2009). These studies have focused on space-based (Reynolds and Heeger, 2009; Ni et al., 2012; Lee et al., 2009) or feature-based (Ni and Maunsell, 2019) attention. While it has been suggested that these different forms of attention affect neural responses in similar ways, there exist distinctions in their reported effects, such as different time courses (Hayden and Gallant, 2005) and the extent to which they affect different locations in the visual field (Serences and Boynton, 2007; Womelsdorf et al., 2006), suggesting that there are common sources as well as differences in modulation mechanisms between these forms of attention (Ni and Maunsell, 2019). This leaves open the question of whether normalization can explain the effects of object-based attention.
In the human visual cortex, normalization has been speculated to underlie response modulations in the presence of attention, with evidence provided both by behavioral studies of space-based (Herrmann et al., 2010) and feature-based (Herrmann et al., 2012; Schwedhelm et al., 2016) attention, as well as neuroimaging studies of feature-based attention (Bloem and Ling, 2019). Although previous studies have qualitatively suggested the role of normalization in the human visual cortex (Bloem and Ling, 2019; Kliger and Yovel, 2020; Itthipuripat et al., 2014; Zhang et al., 2016), evidence for directly testing the validity of the normalization model in predicting human cortical responses in a quantitative way remains scarce. A few studies have demonstrated the quantitative advantage of normalization-based models compared to linear models in predicting human fMRI responses using gratings, noise patterns, and single objects (Kay et al., 2013a; Kay et al., 2013b), as well as moving checkerboards (Aqil et al., 2021; Foster and Ling, 2021). However, whether normalization can also be used to predict cortical responses to multiple objects, and if and to what extent it can explain the modulations in response caused by attention to objects in the human brain remain unanswered.
To fill this gap and to explore the discrepancies reported about multiple-stimulus responses, here, we aimed to evaluate the predictions of the normalization model against observed responses to visual objects in several regions of the human brain in the presence and absence of attention. In an fMRI experiment using conditions with isolated and cluttered stimuli and recording the response with or without attention, we provide a comprehensive account of normalization in different regions of the visual cortex, showing its success in adjusting the gain related to each stimulus when it is attended or ignored. We also demonstrate that normalization is closer to average in the absence of attention, as previously reported by several studies (Zoccolan et al., 2005; Macevoy and Epstein, 2009; Kliger and Yovel, 2020), but that the results of the weighted average model and the normalization model diverge to a greater extent in the presence of attention. Our work in the human brain, along with previous studies of normalization in the monkey and human brain, suggests the role of normalization as a canonical computation in the primate brain.
Results
Attention modulates responses to isolated and paired stimuli
In a blocked-design fMRI paradigm, human participants (N = 19) viewed semi-transparent gray-scale stimuli from the two categories of houses and human bodies (Figure 1a). Each experimental run consisted of one-stimulus (isolated) and two-stimulus (paired) blocks, with attention directed either to an object stimulus or to the color of the fixation point. There was an additional fixation color block in each run with no object stimuli, in which the participants were asked to attend to the fixation point color. The experiment, therefore, had a total number of eight conditions (four isolated, three paired, and one fixation conditions, see Figure 1c). In paired blocks, we superimposed the two stimuli to minimize the effect of spatial attention and force participants to use object-based attention (Figure 1b and c). Participants were asked to perform a one-back repetition detection task on the attended object, or a color change detection task on the fixation point (Figure 1b, see Methods for details). Independent localizer runs were used to localize the primary visual cortex (V1), the object-selective regions in the lateral occipital cortex (LO) and posterior fusiform gyrus (pFs), the extrastriate body area (EBA), and the parahippocampal place area (PPA) for each participant (Figure 1d).

Stimuli, paradigm, and regions of interest.
(a) The two stimulus categories (body and house), with the ten exemplars of the body category. (b) Experimental paradigm including the timing of the trials and the inter-stimulus interval. In the example block depicted on the left, both stimulus categories were presented, and the participant was cued to attend to the house category. The two stimuli were superimposed in each trial, and the participant had to respond when the same stimulus from the house category appeared in two successive trials. The color of the fixation point randomly changed in some trials from red to orange, but the participants were asked to ignore the color change. The example block depicted on the right illustrates the condition in which stimuli were ignored and participants were asked to attend to the fixation point color, and respond when they detected a color change. Subjects were highly accurate in performing these tasks (see Figure 1—figure supplement 1). (c) The eight task conditions in each experimental run. For illustration purposes, we have shown the attended category in each block with orange outlines. The outlines were not present in the actual experiment. (d) Regions of interest for an example participant, including the primary visual cortex V1, the object-selective regions LO and pFs, the body-selective region EBA, and the scene-selective region PPA.
Each task condition was named based on the presented stimuli and the target of attention, with B and H denoting the presence of body and house stimuli, respectively, and the superscript denoting the target of attention. Therefore, the seven task conditions include Bat, BatH, BHat, Hat, B, H, and BH. For instance, the Hat condition refers to the isolated house condition with attention directed to house stimuli, and the BH condition refers to the paired condition with attention directed to the fixation point color. Overall, the average accuracy was higher than 86% in all conditions. Averaged across participants, accuracy was 94%, 89%, 86%, 93%, 94%, 96%, 95%, and 96% for Bat, BatH, BHat, Hat, B, H, and BH conditions and the fixation block with no stimulus, respectively. A one-way ANOVA test across conditions showed a significant effect of condition on accuracy () and reaction time (). As expected, post-hoc t-tests showed that this was due to lower performance in the BatH and BHat conditions (see Figure 1—figure supplement 1). There was no significant difference in performance between any other conditions ().
To examine the cortical response in different task conditions, we fit a general linear model and estimated the regression coefficients for each voxel in each condition. Figure 2 illustrates the average voxel coefficients for different conditions in the five regions of interest (ROIs), including V1, LO, pFs, EBA, and PPA. Note that we have not included the responses related to the fixation block with no stimulus since this condition was only used to select the voxels that were responsive to the presented stimuli in each ROI (see Methods). We observed that the average voxel coefficients related to the four conditions in which attention was directed to the body or the house stimuli (the first four conditions, Bat, BatH, BHat, Hat) were generally higher than the response related to the last three conditions (B, H, and BH conditions) in which the body and house stimuli were unattended (). This is in agreement with previous research indicating that attention to objects increases their cortical responses (Reddy et al., 2009; Roelfsema et al., 1998; O’Craven et al., 1999).

Average fMRI regression coefficients and voxel preference for the two categories in all regions of interest (ROIs).
(a–e) Average fMRI regression coefficients for each condition are illustrated in the five ROIs. Each condition’s label denotes the presented stimuli and the target of attention, with B and H, respectively, denoting the presence of body and house stimuli and the superscript denoting the target of attention. Therefore, the seven task conditions include Bat, BatH, BHat, Hat, B, H, and BH. For instance, the Hat condition refers to the isolated house condition with attention directed to houses, and the BH condition refers to the paired condition with attention directed to the fixation point color. Error bars represent standard errors of the mean for each condition, calculated across participants after removing the overall between-subject variance. N = 19 human participants. (f) The ratio of voxels preferring bodies and houses in each ROI. Both the regression coefficients and the voxel preference ratios were consistent across odd and even runs (see Figure 2—figure supplement 1 and Figure 2—figure supplement 1).
Looking more closely at the results in the regions EBA and PPA that have strong preferences for body and house stimuli, respectively, it seems that the effect of attention interacts with the regions’ preference. For instance, in the body-selective region EBA, the response to attended body stimuli in isolation is similar to the response to attended body stimuli paired with unattended house stimuli (compare Bat and BatH bars). On the other hand, the response to attended house stimuli in the isolated condition is significantly less than the response to attended house stimuli paired with unattended body stimuli. We can observe similar results in PPA, but not in V1 or the object-selective regions LO and pFs. But note that the latter three regions do not have strong a preference for one stimulus versus the other. Therefore, in order to examine the interaction between attention and preference more closely, we determined preferences at the voxel level in all ROIs.
We defined the preferred (P) and null (N) stimulus categories for each voxel in each ROI according to the voxel’s response to isolated body and isolated house conditions. Figure 2f shows the percentage of voxels in each region that were selective to bodies and houses averaged across participants. As illustrated in the figure, in the object-selective regions LO and pFs, almost half of the voxels were selective to each category, while in the EBA and PPA regions, the general preference of the region prevailed (Even though these regions were selected based on their preference, the noise in the fMRI data and other variations due to imperfect registration led to some voxels showing different preferences in the main session compared to the localizer session Peelen and Downing, 2005).
After determining voxel preferences, we rearranged the seven task conditions according to each voxel’s preference. The conditions are hereafter referred to as: Pat, PatN, PNat, Nat, P, PN, N, with and denoting the presence of the preferred and null stimuli, respectively, and the superscript denoting the attended category. Mean voxel responses in the five ROIs for all task conditions are illustrated by navy lines in Figure 3a–e. Note that although the seven conditions constitute a discrete and not a continuous variable, we have connected the responses in attended conditions (in which body or house stimuli were attended) and unattended conditions (in which body and house were ignored and the fixation point color was attended) separately. This was done for visual purposes and ease of understanding.

Divisive normalization explains voxel responses in different stimulus conditions.
(a–e) Average fMRI responses and model predictions in the five regions of interest. Navy lines represent average responses. Light blue, gray, and orange lines show the predictions of the weighted sum, the weighted average, and the normalization models, respectively. The x-axis labels represent the 7 task conditions, Pat, PatN, PNat, Nat, P, PN, N, with P and N denoting the presence of the preferred and null stimuli and the superscript denoting the attended category. For instance, P refers to the condition in which the unattended preferred stimulus was presented in isolation, and PatN refers to the paired condition with the attended preferred and unattended null stimuli. Error bars represent standard errors of the mean for each condition, calculated across participants after removing the overall between-subject variance. N = 19 human participants. (f) Mean explained variance, averaged over voxels in each region of interest for the 5 conditions predicted by the three models. Light blue, gray, and orange bars show the average variance explained by the weighted sum, the weighted average, and normalization models, respectively. Error bars represent the standard errors of the mean. N = 19 human participants. Dashed lines above each set of bars indicate the noise ceiling in each ROI, with the light blue shaded area representing the standard errors of the mean calculated across participants (see Figure 3—figure supplement 1 for an example illustration of how the goodness of fit was calculated for each voxel). As observed in the figure, the normalization model was a better fit for the data compared to the weighted sum (ps < 0.02) and the weighted average (ps < 0.0001) models. Simulation results demonstrate that this superiority is not related to the higher number of parameters or the nonlinearity of the normalization model (see Figure 3—figure supplement 2).
We observed that the mean voxel response was generally higher when each stimulus was attended compared to the condition in which it was ignored. For instance, the response in the Pat condition (in which the isolated preferred stimulus was attended) was higher than in the P condition (where the isolated preferred stimulus was ignored) in LO, pFs, and PPA (), marginally higher in EBA (), and not significantly higher in V1 (). Similarly, comparing the N and Nat conditions in each ROI, we observed an increase in response caused by attention in all ROIs () except for V1 (). A similar trend of response enhancement due to attention could also be observed in the paired conditions: attending to either stimulus increased the response in all ROIs () except for V1 (). In all cases, the effect of attention was absent or only marginally significant in V1, which is not surprising since attentional effects are much weaker (McAdams and Maunsell, 1999) or even absent (Luck et al., 1997) in V1 compared to the higher-level regions of the occipito-temporal cortex. Next, we asked whether we could predict these response variations and attentional modulations caused by the change in the presented stimuli and the target of attention using three different models.
Divisive normalization explains voxel responses in different stimulus conditions
We used the three models of weighted sum, weighted average, and normalization to predict voxel responses in different task conditions. Based on the weighted sum model, the response to multiple stimuli is determined by the sum of the responses to each individual stimulus presented in isolation, and attention to each stimulus increases the part of the response associated with the attended stimulus. For instance, in the presence of a null and a preferred stimulus with attention to the preferred stimulus, the response can be determined by , with , , and , denoting the response elicited by both stimuli with attention directed to the preferred stimulus, the response to the isolated preferred stimulus, and the response to the isolated null stimulus, respectively. is the attention-related parameter.
According to the weighted average model, the response to multiple stimuli is determined by the average of isolated-stimulus responses, and weighted by the parameter related to attention. Therefore, with an attended preferred and an ignored null stimulus, the response can be written as: .
Finally, based on the normalization model, the response to a stimulus is determined based on the excitation due to that stimulus and the suppression due to the neighboring neuronal pool. Therefore, the response to an attended preferred and an ignored null stimulus is determined by: , where and respectively denote the excitation caused by the preferred and the null stimulus, and represents the semi-saturation constant. and are the respective contrasts of the preferred and null stimuli. Zero contrast for a stimulus denotes that the stimulus is not present in the visual field. In our experiment, we set contrast values to one when a stimulus was presented, and to zero when the stimulus was not presented (see Methods for detailed descriptions of models).
Although many studies have demonstrated that responses to multiple stimuli are added sublinearly in the visual cortex (Heeger, 1992; Bloem and Ling, 2019; Reddy et al., 2009; Aqil et al., 2021), it has been suggested that for weak stimuli, response summation can approach a linear or even a supralinear regime (Rubin et al., 2015; Heuer and Britten, 2002). Since the stimuli we used in this experiment were presented in a semi-transparent form and were therefore not in full contrast, we found it probable that the response might be closer to a linear summation regime in some cases. We thus used the weighted sum model to examine whether the response approaches linear summation in any region.
To compare the three models in their ability to predict the data, we split the fMRI data into two halves (odd and even runs) and estimated the model parameters separately for each voxel of each participant twice: once using the first half of the data, and a second time using the second half of the data. All comparisons of data with model predictions were made using the left-out half of the data in each case. All model results illustrate the average of these two cross-validated predictions. Note that this independent prediction is critical since the numbers of parameters in the three models are different. Possible over-fitting in the normalization model with more parameters will not affect the independent predictions (Kay et al., 2013b). The predictions of the three models for the five modeled task conditions are illustrated in Figure 3a–e (the two isolated ignored conditions P and N were excluded as they were used by the weighted sum and the weighted average models to predict responses in the remaining five conditions, see Methods).
As evident in the figure, the predictions of the normalization model (in orange) are generally better than the predictions of the weighted sum and the weighted average models (light blue and gray, respectively) in all regions. To quantify this observation, we calculated the goodness of fit for each voxel by taking the square of the correlation coefficient between the predicted model response and the respective fMRI responses across the five modeled conditions (Figure 3—figure supplement 1). We also calculated the noise ceiling in each region separately as the r-squared of the correlation between the odd and even halves of the data. Given that the correlation between the model and the data cannot exceed the reliability of the data (as calculated by the correlation between the data from odd and even runs), the r-squared can also not exceed the squared split-half reliability. The noise ceiling (squared split-half reliability), therefore, determines the highest possible goodness of fit a model can reach. The results are illustrated in Figure 3f.
We first compared the goodness of fit of the three models across the five ROIs using a repeated measures ANOVA. The results showed a significant main effect of model () and ROI (), and a significant model by ROI interaction (). On closer inspection, the normalization model was a better fit to the data than both the weighted sum () and the weighted average () models in all ROIs. Since the normalization model had more parameters, we also used the AIC measure to correct for the difference in the number of parameters. The normalization model was a better fit according to the AIC measure as well (see Supplementary file 2). It is noteworthy that while the weighted average model performed better than the weighted sum model in LO and EBA (), it was not significantly better in pFs and PPA (), and worse than the weighted sum model in V1 ().
We then calculated the normalization model’s r-squared difference from the noise ceiling (NRD) for each ROI (Equation 7). NRD is a measure of the ability of the model in accounting for the explainable variation in the data; the lower the difference between the noise ceiling and a model’s goodness of fit, the more successful that model is in predicting the data. We ran a one-way ANOVA to test for the effect of ROI on NRD, and observed that this measure was not significantly different across ROIs (), demonstrating that the normalization model was equally successful across ROIs in predicting the explainable variation in the data.
Interestingly, just focusing on the paired condition in which none of the stimuli were attended (the PN condition), the results of the weighted average model were closer to normalization (the gray and orange isolated data points on the subplots a-e of Figure 3 are similarly close to the navy point of data in some regions). For this condition, the predictions of the normalization model were significantly closer to the data compared to the predictions of the weighted average model in V1, pFs, and PPA () but not significantly closer to the data in LO and EBA (). These results are in agreement with previous studies suggesting that the weighted average model provides good predictions of neural and voxel responses in the absence of attention (Zoccolan et al., 2005; Macevoy and Epstein, 2009; Kliger and Yovel, 2020). However, when considering all the attended and unattended conditions, our results show that the normalization model is a generally better fit across all ROIs.
To ensure that the superiority of the normalization model over the weighted sum and weighted average models were not caused by the normalization model’s nonlinearity or its higher number of parameters, we ran simulations of three neural populations. Neurons in each population calculated responses to multiple stimuli and attended stimuli by a summing, an averaging, and a normalizing rule (see Methods). We then used the three models to predict the population responses. Our simulation results demonstrate that despite the higher number of parameters, the normalization model is only a better fit for the population of normalizing neurons and not for summing or averaging neurons, as illustrated in Figure 3—figure supplement 2. These results confirm that the better fits of the normalization model cannot be related to the model’s nonlinearity or its higher number of parameters.
Normalization accounts for the change in response with the shift of attention
Next, comparing the responses in different conditions, we observed two features in the data. First, for the paired conditions, shifting attention from the preferred to the null stimulus caused a reduction in voxel responses. We calculated this reduction in response for each voxel by (Figure 4a, top panel). This response change was significantly greater than zero in all ROIs (, ) except V1 ( , ). Because the same stimuli were presented in both conditions but the attentional target changed from one category to the other, this change in response could only be related to the shift in attention and the stimulus preference of the voxels.

Normalization accounts for the observed effects of attention.
(a) Top: Change in BOLD response when attention shifts from the preferred to the null stimulus in the presence of two stimuli, illustrated here for extrastriate body area (EBA). Bottom: The observed response change and the corresponding amount predicted by different models in different regions, calculated as illustrated in plot A. Error bars represent the standard errors of the mean. N = 19 human participants. (b) Top: The observed asymmetry in attentional modulation for attending to the preferred versus the null stimulus, depicted for EBA. Bottom: The observed and predicted asymmetries in attentional modulation in different regions, calculated as illustrated in plot B. Error bars represent the standard errors of the mean. N = 19 human participants.
We then calculated the response change predicted by the three models to investigate model results in more detail. As illustrated in the bottom panel of Figure 4a, the orange bars depicting the predictions of the normalization model were very close to the navy bars depicting the observations in all ROIs, while the predictions of the weighted sum and the weighted average models (light blue and gray bars, respectively) were significantly different from the data in most regions.
To quantify this observation and to compare how closely the predictions of the three models followed the response change in the data, we calculated the difference between the response change observed in the data and the response change predicted by each model. Then, we ran a repeated measures ANOVA with within-subject factors of the model and ROI on the obtained difference values. The results demonstrated a significant effect of model (), a significant effect of ROI (), and a significant model by ROI interaction (). Post-hoc t-tests showed that the predictions of the normalization model were closer to the response change observed in the data in all ROIs () except in V1, where the predictions of the weighted sum and the weighted average models were closer to the data ().
Asymmetry in attentional modulation is explained by the normalization model
The second feature we observed was that the effect of the unattended stimulus on the response depended on voxel selectivity for that stimulus, with the unattended preferred stimulus having larger effects on the response than the unattended null stimulus. Attending to the preferred stimulus in the presence of the null stimulus caused the response to approach the response elicited when attending to the isolated preferred stimulus. Therefore, attention effectively removed the effect of the null stimulus. However, attending to the null stimulus in the presence of the preferred stimulus did not eliminate the effect of the preferred stimulus and yielded a response well above the response elicited by attending to the isolated null stimulus. While this is the first time such asymmetry has been reported in human fMRI studies, these results are in agreement with previous monkey electrophysiology studies, showing the existence of an asymmetry in attentional modulation for attention to the preferred versus the null stimulus (Lee and Maunsell, 2010; Ni et al., 2012).
To quantify the observed asymmetry, we calculated an asymmetry index for each voxel by , which is illustrated in the top panel of Figure 4b. This index was significantly greater than zero in all regions ().
Here, too, the normalization model was better at predicting the observed asymmetry in the data. The bottom panel of Figure 4b illustrates the asymmetry indices for the data and the three models in all regions. We calculated the difference between the asymmetry index observed in the data and the predicted index by each model and performed a repeated measures ANOVA to compare the three models in how closely they predicted the asymmetry effect across ROIs using these difference values. We observed a significant effect of model (), a significant effect of ROI (), and a significant model by ROI interaction (). The prediction of the normalization model was closer to the data in all regions () except for PPA, where the prediction of the weighted sum model was closer to the asymmetry observed in the data than the prediction of the normalization model ().
Other variants of the weighted average model
The weighted average model we used in previous sections had equal weights for the preferred and null stimuli, with attention biasing the attended preferred or null stimulus with the same amount (the weighted average EW model). However, different stimuli might have different weights in the paired response depending on the neurons’ preference toward the stimuli. Besides, attention may bias preferred and null stimuli differently. Therefore, to examine the effect of unequal weights and attention parameters on the fits of the weighted average model, we tested two additional variants of this model.
To examine how unequal weights affect the fit of the weighted average model, we tested the weighted average UW model. Comparison of the fit of this model with the weighted average EW model showed that the UW variant was a significantly better fit than the EW model in all regions () except in LO, where it was a marginally better fit (). In the next step, to examine the effect of unequal weights and attention parameters on the fit of the weighted sum and the weighted average models, we tested the weighted average UWUB variant. This model had unequal weights and unequal attention parameters for the P and N stimuli. In this variant, no constraint was put on the sum of the weights. Thus, this model was effectively a generalization of the weighted sum and the weighted average models with four parameters. This model was a better fit than the weighted average EW model in all regions () except in EBA ().
We next compared the goodness of fit of all weighted average variants with the normalization model using a ANOVA, as illustrated in Figure 5a. There was a significant effect of model (), a significant effect of ROI (), and a significant model by ROI interaction (). Post-hoc t-tests showed that these weighted average variants were still significantly worse fits to the data than the normalization model in all regions () except for EBA, where the normalization model was marginally better than the weighted average UWUB ().

Comparison between the weighted average model variants and the normalization model predictions.
(a) Comparison of the goodness of fit for weighted average variants and the normalization model. (b) The goodness of fit of the normalization model compared to the weighted average unequal weights and unequal betas (UWUB) saturation variant. (c) The asymmetry index was calculated for the data, compared to the predictions of the normalization model and the weighted average UWUB saturation model. Error bars represent the standard errors of the mean, calculated across N = 19 human participants.
Next, to examine whether the observed asymmetry was caused by response saturation, we tested a nonlinear variant of the weighted average model with saturation (the weighted average UWUB saturation model). This model’s goodness of fit, as well as its predictions of asymmetry, are illustrated in Figure 5b and c. As illustrated in the figure, the saturation model’s predicted asymmetry was closer to the data than normalization’s prediction only in EBA (). In other regions, the normalization model’s prediction of asymmetry was either significantly closer to the data (in V1, LO, and pFs, ) or not significantly different from the saturation model (in PPA, ). After running a ANOVA to compare the fits of the normalization model and the weighted average UWUB saturation model across ROIs, we observed a significant effect of model (), a significant effect of ROI (), and a significant model by ROI interaction (). Post-hoc t-tests showed that the normalization model was a significantly better fit than the saturation model in all ROIs ().
Discussion
Here, using single-voxel modeling, we examined the validity of the normalization model and demonstrated its superiority to the weighted sum and the weighted average models in predicting cortical responses to isolated and cluttered object stimuli. We also showed the success of the normalization model in predicting the observed effects of object-based attention, further suggesting it as a fundamental operation in the human brain.
While several electrophysiology studies have examined normalization in the monkey brain (Bao and Tsao, 2018; Ni et al., 2012; Ni and Maunsell, 2017; Ni and Maunsell, 2019), and although normalization has also been proposed to operate in the human brain (Bloem and Ling, 2019; Kay et al., 2013a), evidence for its validity in the human brain, particularly in the presence of attention, is still scarce. Expanding on the results of previous studies, showing the role of normalization in the human visual cortex for simple stimuli (Kay et al., 2013a; Kay et al., 2013b; Aqil et al., 2021), our work offers evidence for the role of normalization in multiple-object representation and object-based attention.
After comparing model predictions with the data, we investigated the effect of attention on multiple-object representation. Defining preferred and null stimuli for each voxel, we showed that when attention shifted from the preferred to the null stimulus, there was a significant reduction in response in multiple regions of interest in the occipito-temporal cortex but not in the primary visual area. Furthermore, this response reduction increased as we moved to higher regions across the hierarchy, consistent with speculations of greater effects of top-down attention on higher regions of the visual cortex (Cook and Maunsell, 2002). Although this response reduction has also been predicted by the biased competition model (Desimone and Duncan, 1995), our results showed that the predictions of the weighted average model were significantly lower than the response reduction observed in the data in all ROIs except in V1. In contrast, the normalization model predicted the response reduction more accurately in all regions except in V1, where no significant response reduction was observed. Previous reports of the effects of attention in V1 have also been controversial, with some reporting attentional effects in this region (Somers et al., 1999; Gandhi et al., 1999) while others report little (McAdams and Maunsell, 1999) to no observed effects of attention (Luck et al., 1997). As suggested by Heeger and Ress, 2002, this might be due to the difference in task design and the experimental method used.
Moreover, our results indicated an asymmetry in attentional modulation when attending to the preferred versus the null stimulus. We demonstrated that while attention to a preferred stimulus almost eliminates the effect of the ignored null stimulus, attention to the null stimulus does not remove the effect of the preferred stimulus. Unlike response change with the shift of attention, which increases across the hierarchy, the asymmetry measured by our defined index remains approximately the same in all regions, indicating its dependence not on the top-down attentional signal but on the normalization computation performed in each region. This feature was also predicted by the normalization model but not by the weighted sum and the weighted average models.
Normalization has been reported to cause neural populations to operate in the averaging or winner-take-all regimes based on stimulus contrast (Busse et al., 2009). Here, we showed that in the presence of attention, responses can deviate from the averaging regime even without a change in contrast. We observed a winner-take-all behavior when the preferred stimulus was attended since its higher response along with its increase in gain due to attention, made it a much stronger input compared to the ignored null stimulus. On the other hand, when the null stimulus was attended, the response was closer to an average than a max-pooling response. This result explains why several previous studies in the object-selective regions indicated averaging as the rule for multiple-stimulus representation (Zoccolan et al., 2005; Macevoy and Epstein, 2009), while studies in regions with strong preferences for a particular category reported a winner-take-all mechanism (Bao and Tsao, 2018). We, therefore, extend previous reports of multiple-stimulus representation showing that the response is related to stimulus contrast (Busse et al., 2009) and neural selectivity (Bao and Tsao, 2018) by demonstrating that a combination of bottom-up and top-down signals act to yield a response that can be the average of the isolated responses, or a winner-take-all response, or somewhere between the two. We also demonstrate for the first time that the normalization model is superior to the weighted average model, which has often been used in lieu of the normalization model (Zoccolan et al., 2005; Macevoy and Epstein, 2009; Kliger and Yovel, 2020), in its ability to account for fMRI responses in the presence of attention. Testing other variants of the weighted average model with unequal weights and unequal attention parameters for the preferred and null stimuli, we demonstrate that the normalization model is a better fit compared to all these variants of the weighted average model.
Stimulus contrast has also been shown to have a crucial role in how single-stimulus responses are added to obtain the multiple-stimulus response. While responses to strong high-contrast stimuli are added sublinearly to yield the multiple-stimulus response, as predicted by the normalization model and the weighted average model, the sublinearity decreases for lower contrasts and even changes to linearity and supralinearity for weak stimuli (Heuer and Britten, 2002; Rubin et al., 2015). Here, since the stimuli we used were not in full contrast, we tested the weighted sum model as well to examine whether responses approach linearity in any region. Our results demonstrate that while the weighted average model generally performs better than the weighted sum model in the higher-level occipito-temporal cortex, the weighted sum model provides better predictions in V1. These results suggest stronger sublinearity in higher regions of the visual cortex compared to V1, which is in agreement with previous reports (Kay et al., 2013b). This observation might be related to the higher sensitivity of V1 neurons to contrast (Goodyear and Menon, 1998), causing a more significant decrease in V1 responses to low-contrast stimuli. This, in turn, might make the low-contrast stimulus weaker for V1 neurons, causing a move toward a lower level of sublinearity (Sceniak et al., 1999).
Attention to a stimulus has been suggested in the literature to be similar to an increase in the contrast of the attended stimulus (Ni et al., 2012), which is manifested in the similar effects of attention and contrast in the normalization equation. In this study, we presented the stimuli with a constant contrast but changed the number of stimuli and their attentional state to determine whether the normalization model could explain the effects of object-based attention in the human visual cortex, which has not been previously studied. We acknowledge that to fully ascertain the role of normalization in the human brain, we have to measure the contrast response function in each voxel to truly constrain the models and have conditions with varying levels of contrasts across attentional manipulations. Note that including variations of both attentional state and contrast is not trivial and is not possible with a single-session fMRI experiment. Our results remain suggestive of the role of normalization until these conditions are tested in future multi-session experiments.
Here, we compared the nonlinear normalization model with two linear models with fewer free parameters. To ensure that the difference in the number of free parameters did not affect the results, we used cross-validation and the AIC measure to compare model predictions with the data. If the success of the normalization model was due to its higher number of free parameters, it would affect its predictions for left-out data. We observed that the normalization model was also successful in predicting the left-out part of the data. In addition, we tested a nonlinear model variant with five free parameters. This model was still a worse fit than the normalization model. Finally, we used simulations of three different neural populations, with neurons in each population following either a summing, averaging, or normalization rule in their response to multiple stimuli and attended stimuli. Simulation results demonstrated that the normalization model was a better fit only for the normalizing population, confirming that the success of the normalization model is not due to its nonlinearity or the higher number of parameters but rather as a result of it being a closer estimation of the computation performed at the neural level for object representation.
It is noteworthy that here, we are looking at the BOLD responses. We are aware of the limitations of the fMRI technique as the BOLD response is an indirect measure of the activity of neural populations. While an increase in the BOLD signal could be related to an increase in the neuronal firing rates of the local population (Logothetis et al., 2001), it could also be related to subthreshold activity resulting from feedback from the downstream regions of the visual cortex (Heeger and Ress, 2002). The observed effects, therefore, may be related to local population responses or may be influenced by feedback from downstream regions. Also, since the measured BOLD signal is related to the average activity of the local population, and we do not have access to single-unit responses, some effects may change in the averaging process. Nevertheless, our simulation results show that the effects of the normalization computation are preserved even after averaging. We should keep in mind, though, that these are only simulations and are not based on data directly measured from neurons. Future experiments with intracranial recordings from neurons in the human visual cortex would be invaluable in validating our results.
Another limitation in interpreting the results is related to a possible stronger saturation of the BOLD response, which can potentially explain the observed asymmetry in attentional modulation. Since the asymmetry in attentional modulation has also been previously reported at the neural level (Ni et al., 2012), this effect is unlikely to be exclusively caused by the saturation in the BOLD signal. It is noteworthy, however, that saturation is a characteristic of cortical responses even at the neural level. Whether this effect at the neural level is caused by response saturation or as a result of the normalization computation cannot be distinguished from our current knowledge. Nevertheless, we tested a variant of the weighted average model with an extra saturation parameter. Although this model could partially predict the observed asymmetry, the predictions were worse than the normalization model’s predictions. Also, this model was an overall worse fit to the data compared to the normalization model. The normalization model, therefore, provides a more parsimonious account of the data. Nevertheless, we have only tested a saturation model. The exact nonlinearities that affect the transformation of neural population responses to the BOLD response are not fully mapped out yet, especially in cases where multiple overlapping stimuli are presented in the visual scene. It is possible that the advantage of the normalization model could be at least partly related to these nonlinearities. Modeling such nonlinearities requires experiments that simultaneously record fMRI and neural data. The validity of our conclusions about the superiority of the normalization model should be reevaluated after such data becomes available.
In sum, our results indicate that the normalization model predicts responses at the voxel level beyond the primary visual cortex and across the visual hierarchy, especially in higher-level regions of the human occipito-temporal cortex, with and without attention and in conditions with isolated or cluttered stimuli. We, therefore, provide evidence suggesting divisive normalization as a canonical computation operating in the human brain during object-based attention.
Methods
Participants
A total of 21 healthy right-handed participants (10 females, 20–40, all with normal or corrected-to-normal vision) participated in the experiment. All participants gave written consent prior to their participation and received payment for their participation in the experiment. Imaging was performed according to safety guidelines approved by the ethics committee of the Institute for Research in Fundamental Sciences (IPM). Data from two participants were removed from the analysis because of excessive head motion (more than 3 mm throughout the session). This exclusion criteria was established before running the experiment. Based on previous studies with sample sizes of 5–10 (Bloem and Ling, 2019; Reddy et al., 2009; Kay et al., 2013a; Kay et al., 2013b), we are confident that with this sample size, we have enough power for comparing different model predictions of fMRI responses.
Stimuli and experimental procedure
Stimuli were from the two categories of human bodies and houses similar to the ones used in previous studies (Vaziri-Pashkam and Xu, 2017; Vaziri-Pashkam and Xu, 2019; Xu and Vaziri-Pashkam, 2019). Each category consisted of ten unique exemplars in gray-scale format (Figure 1a). These exemplars differed in identity, pose (for bodies), and viewing angle (for houses). Stimuli were fitted into a transparent square subtending 10.2° of visual angle and placed on a gray background. A central red fixation point subtending 0.45° of visual angle was present throughout the run. Stimuli from each category were presented in semi-transparent form, in isolation, or paired with stimuli from the other category.
In a blocked-design paradigm, participants were instructed by a word cue to attend to bodies, houses, or the color of the fixation point at the beginning of each block. Therefore, the stimuli from each category were either attended (when the category was cued), or ignored (when the fixation point color was cued), in isolation or cluttered by stimuli from the other category. The main experiment thus consisted of seven blocks with all possible combinations of stimulus categories and attention: Attend isolated bodies, Attend cluttered bodies, Attend isolated houses, Attend cluttered houses, Ignore isolated bodies, Ignore isolated houses, Ignore cluttered bodies and houses (Figure 1c). In blocks with attention to bodies or houses, participants performed a one-back repetition detection task on the attended stimuli by pressing a button when the exact same stimulus appeared in two consecutive trials. In blocks with attention directed toward the fixation point color, participants responded when the color of the fixation point changed from red to orange. These blocks served as conditions in which the visual stimuli (bodies and houses) were ignored. Target repetition and fixation point color change occurred 2–3 times at random in each block. There was also an additional fixation color block in each run, with a red fixation point presented in the middle of the gray screen and in the absence of stimuli from either category. The participants’ task in this block was to detect a fixation point color change. The fixation point color changed to orange two or three times during the block. We used a contrast between the BH condition (with the same task but with the presence of both body and house stimuli) and this fixation condition to select the voxels in each ROI that were responsive to the presented stimuli. This was especially important for V1 voxels, since the stimuli presented in the early visual area localizer were larger than the stimuli presented in the main experiment.
Each run started with an 8 s fixation. Each block lasted for 10 s, starting with a 1 s cue and a 1 s fixation. 10 exemplars from one or both categories were presented during the block, each for 400 ms, followed by 400 ms of fixation. There was an 8 s fixation between blocks, and a final 8 s fixation at the end of the last block. The presentation order of the blocks was random, and counterbalanced across the experimental runs. Each run lasted 2 min 32 s. For the main experiment, 14 participants completed 16 runs, two participants completed 14 runs, and three participants completed 12 runs.
Localizer experiments
In this study we examined the primary visual cortex V1 along with the object-selective areas LO and pFs, the body-selective EBA, and the scene-selective PPA. All participants completed four localizer runs which were used to define the primary visual and the category-selective ROIs. We used previously established protocols for the localizer experiments, but the details are repeated here for clarification and convenience.
Early visual area localizer
We used meridian mapping to localize the primary visual cortex V1. Participants viewed a black-and-white checkerboard pattern with a diameter of 27.1° of visual angle through a 60 degree polar angle wedge aperture. The wedge was presented either horizontally or vertically. Participants were asked to detect a luminance change in the wedge in a blocked-design paradigm. Each run consisted of four horizontal and four vertical blocks, each lasting 16 s, with 16 s of fixation in between. A final 16 s fixation followed the last block. Each run lasted 272 s. The order of the blocks was counterbalanced within each run. Each participant completed two runs of this localizer.
Category localizer
A category localizer was used to localize the cortical regions selective to scenes, bodies, and objects. In a blocked-design paradigm, participants viewed stimuli from the five categories of faces, scenes, objects, bodies, and scrambled images, with each stimulus subtending 14.3° of visual angle. Each localizer run contained two 16 s blocks of each category, with the presentation order counterbalanced within each run. An 8 s fixation period was presented at the beginning, middle, and end of the run. In each block, 20 stimuli from the same category were presented. Each trial lasted 750 ms with 50 ms fixation in between. Participants were asked to maintain their fixation on a red circle at the center of the screen and press a key when they detected a slight jitter in the stimuli. Participants completed two runs of this localizer, each lasting 344 s. LO (Malach et al., 1995), pFs (Grill-Spector et al., 1998), EBA (Downing et al., 2001), and PPA (Epstein et al., 1999) were then defined using this category localizer.
Data acquisition
Data were collected on a Siemens Prisma MRI system using a 64-channel head coil at the National Brain-mapping Laboratory (NBML). For each participant, we performed a whole-brain anatomical scan using a T1-weighted MP-RAGE sequence at the beginning of data acquisition. For the functional scans, including the main experiment and the localizer experiments, 33 slices parallel to the AC-PC line were acquired using T2*-weighted gradient-echo echo-planar sequences covering the whole brain (TR = 2 s, TE = 30 ms, flip angle = 90°, voxel size = 3 × 3 × 3 mm3, matrix size = 64 × 64). The stimuli were back-projected onto a screen using an LCD projector with the refresh rate of 60 Hz and the spatial resolution of 768 × 1024 positioned at the rear of the magnet, and participants observed the screen through a mirror attached to the head coil. MATLAB and Psychtoolbox were used to create all stimuli.
Preprocessing
fMRI data analysis was performed using FreeSurfer (https://surfer.nmr.mgh.harvard.edu) and in-house MATLAB codes. fMRI data preprocessing steps included 3D motion correction, slice timing correction, and linear and quadratic trend removal. The data in each run were motion-corrected per-run and aligned to the anatomical data using the middle time point of that run. The fMRI data from the localizer was smoothed using a 5 mm FWHM Gaussian kernel, but no spatial smoothing was performed on the data from the main experiment to optimize the voxel-wise analyses. A double gamma function was used to model the hemodynamic response function. We eliminated the first four volumes of each run to allow the signal to reach a steady state.
ROI definition
Using freesurfer’s tksurfer module, we determined the primary visual cortex V1 using a contrast of horizontal versus vertical polar angle wedges to reveal the topographic maps in the occipital cortex (Sereno et al., 1995; Tootell et al., 1998). To define the object-selective areas LO in the lateral occipital cortex and pFs in the posterior fusiform gyrus (Malach et al., 1995; Grill-Spector et al., 1998), we used a contrast of objects versus scrambled images. Active voxels in the lateral occipital and ventral occipitotemporal cortex were selected as LO and pFS, respectively, following the procedure described by Kourtzi and Kanwisher, 2000. We used a contrast of scenes versus objects for defining the scene-selective area PPA in the parahippocampal gyrus (Epstein et al., 1999), and a contrast of bodies versus objects for defining the body-selective area EBA in the lateral occipitotemporal cortex (Downing et al., 2001). The activation maps for both localizers were thresholded at , uncorrected.
Data analysis
We performed a general linear model (GLM) analysis for each participant to estimate voxel-wise regression coefficients for each of the 8 task conditions. The onset and duration of each block were convolved with a hemodynamic response function and were entered to the GLM as regressors. Movement parameters and linear and quadratic nuisance regressors were also included in the GLM. We then used these obtained coefficients to compare the BOLD response in different conditions in each ROI. In order to compensate for the difference in size between the localizer stimuli and the stimuli presented in the main experiment, we selected active voxels in each ROI using a contrast between the BH condition (with ignored body and house stimuli) and the fixation block (with no stimuli presented). We selected the voxels that were significantly active during the BH condition compared to the fixation block (with ) across all runs for any further analyses. Preferred and null categories for each voxel were determined using voxel responses in conditions with isolated stimuli with the participant performing the task on the fixation point (color blocks). We then determined the activity in seven conditions for each voxel: Pat, PatN, PNat, Nat, P, PN, N, with P and N denoting the presence of the preferred and null stimuli and the superscript denoting the attended category. For instance, P refers to the condition in which the unattended preferred stimulus was presented in isolation, and PatN refers to the paired condition with the attended preferred and unattended null stimuli.
Model details
We used three models to predict the results: the weighted sum, the weighted average, and the normalization models. The weighted sum model is a simple linear model suggesting that the response to multiple stimuli is the sum of responses to individual stimuli, and attention to a stimulus increases the response to that stimulus by the attention-related parameter, :
, with denoting the response elicited with the preferred and null stimuli present in the receptive field, and and denoting the response to isolated preferred and null stimuli, respectively. The superscript specifies the attended stimulus, and the stimulus is ignored otherwise.
The weighted average model (Zoccolan et al., 2005; Macevoy and Epstein, 2009; Baeck et al., 2013) is also a linear model that posits the response to multiple stimuli as the average of individual responses. Similar to the weighted sum model, the response of an attended stimulus is enhanced by the parameter related to attention, :
The normalization model of attention (Heeger, 1992; Carandini et al., 1997; Reynolds and Heeger, 2009; Ni et al., 2012) can be described using divisive normalization with a saturation term in the denominator:
Here, and denote the excitatory drive induced by the preferred or the null stimulus, respectively and represents the semi-saturation constant. and are the respective contrasts of the stimuli. Zero contrast denotes that the respective stimulus is not present in the visual field. In our experiment, we set contrast values to one when a stimulus was presented and to zero when the stimulus was not presented. When attention is directed towards one of the stimuli, we can rewrite the Equation 3 as:
It is noteworthy that the normalization model is different in nature from the two linear models and takes into account the suppression caused by the neighboring pool even in the presence of a single stimulus. We cannot, therefore, use the measured response in isolated conditions as the excitatory drive in paired conditions. Rather, we need extra parameters to estimate the excitation caused by each stimulus. We then use this excitation to predict the response to attended and ignored stimuli in isolated and paired conditions (Ni et al., 2012; Ni and Maunsell, 2019). On the other hand, the weighted sum and the weighted average models are not concerned with the underlying excitation and suppression. The assumption of these models is based on the resulting response that we actually measure in the paired condition, considering it to be respectively the sum or the average of the measured response in the isolated conditions. In order to take the difference in the number of model parameters into account, we have used both cross-validated r-squared on independent data and AIC measures (see the section on model comparison).
We fit model parameters for the three models. was fit as a free parameter for all models. The normalization model had three additional free parameters, , , and σ. σ and were constrained to be greater than zero and one, respectively, and less than 10. and were constrained to have an absolute value of less than 10. We estimated model parameters using constrained nonlinear optimizing, which minimized the sum-of-square errors. Values of the estimated parameters of the weighted sum, weighted average, and normalization models are provided in Supplementary files 3–5, respectively, for the odd and even runs.
Weighted average model variants
In addition to the three main models, we tested three variants of the weighted average model. The main weighted average model had equal weights for the two stimuli (weighted average EW). For the first variant, we examined a weighted average model with unequal weights for the two stimuli (weighted average UW). According to this model, the response to two simultaneously-presented stimuli was a weighted average of the responses to isolated stimuli, but in contrast to the main weighted average model we used, the weights were not equal in this average. Instead, each stimulus had a different weight in the average, but the sum of the weights was set to 1:
Here, denotes the response elicited with the preferred and null stimuli present in the receptive field, and and denote the response to isolated preferred and null stimuli, respectively. The superscript specifies the attended stimulus, and the stimulus is ignored otherwise. denotes the weight of the preferred stimulus, and is the attention-related parameter.
The second variant we tested was a weighted average variant with unequal weights and unequal betas (weighted average UWUB). Based on this model, the preferred and null stimuli had different weights and different attention-related parameters:
, where and respectively denote the weight of the preferred and null stimuli, and and denote the attention-related parameter of the preferred and null stimuli, respectively. Here, there was no limitation on the sum of weights to equal 1, as was the case for the weighted average EW and weighted average UW models. Therefore, this model was a generalization of the weighted sum and weighted average models.
Lastly, we tested a nonlinear variant with unequal weights, unequal betas, and an extra saturation parameter (weighted average UWUB saturation). Being similar to the weighted average UWUB model in theory, it also estimated a saturation value, s, for each voxel. After parameter estimation, the minimum value of the calculated response and the estimated saturation parameter, , was chosen as the response:
Values of the estimated parameters of the weighted average UW, UWUB, and UWUB saturation model variants are provided in Supplementary files 6–8, respectively, for the odd and even runs.
Model-data comparison
We split the fMRI data into two halves of odd and even runs and estimated model parameters for the first half as described. Then, using the estimated parameters for the first half, we calculated model predictions for each voxel in each condition and compared the predictions with the left-out half of the data. All comparisons of data with models, including the calculation of the goodness of fit, were done using the left-out data. We repeated this procedure twice: once using the odd half of the data for parameter estimation and the even half for comparing model predictions with the data, and a second time using the even half of the data for parameter estimation and the odd half for comparison with the model predictions. All figures, including model results, illustrate the average of the two repetitions. Since the weighted sum and the weighted average models used the response in the P and N conditions to predict responses in the remaining five conditions, we only used these five conditions and excluded the P and N conditions when calculating the goodness of fit for all models. The goodness of fit was calculated by taking the square of the correlation coefficient between the observed and predicted responses for each voxel across the five modeled conditions (Figure 3—figure supplement 1). We also calculated the correlation between voxel responses of the two halves of the data across the same five conditions and calculated the noise ceiling in each ROI as the squared coefficient of this correlation. We determined the r-squared difference from the noise ceiling (NRD) in each ROI by calculating the difference between the noise ceiling and the model’s goodness of fit in that ROI:
We compared the goodness of fit of the three models across all ROIs using a repeated measures ANOVA (see the statistics section). We also compared the NRD of the normalization model across all ROIs using a one-way ANOVA. In order to compensate for the difference in the number of parameters for different models, we used the Akaike Information Criterion (AIC) (Burnham and Anderson, 2004; Denison et al., 2021). Under the assumption of a normal distribution of error, AIC is calculated by:
, where n denotes the number of observations, RSS denotes the residual sum of squares, k is the number of free parameters of the model, and C is a constant with the same amount for all models. A smaller AIC value shows that the model fits the data better. We therefore calculated ΔAIC for all model pairs.
Simulation
To further check whether the success of the normalization model was due to its higher number of parameters or as a result of it being a closer estimation of the performed neural computations, we used a simulation approach. We simulated neural responses for single and multiple stimuli in the absence and presence of attention. In a neural population composed of 104 neurons, neurons were body- or house-selective. Each neuron also responded to the category other than its preferred category, but to a lesser degree and with variation. We had three kinds of neurons: (i) summing neurons, for which the response to multiple stimuli and attended stimuli was calculated based on the weighted sum model, (ii) averaging neurons, which behaved based on the weighted average model, and (iii) normalizing neurons, which behaved based on divisive normalization. We chose neural responses and the attention factor randomly from a range comparable with neural studies of attention and object recognition in the ventral visual cortex (Ni et al., 2012; Bao and Tsao, 2018). Using equations discussed for each of the models, we calculated the response of each neuron to the seven conditions of our main fMRI experiment. Then, we randomly chose 200 neurons from the population, with the ratio of body/house preference similar to each of the ROIs in the main experiment. We then averaged the selected neurons’ responses to make up a voxel and added Gaussian noise to the voxel, 16 times for each voxel with a different Gaussian noise every time to make up 16 measurements (16 runs, as in the fMRI experiment) for each condition. We had 30 voxels for each ROI. Then, dividing the runs into two halves, we performed the same modeling process as in the fMRI experiment. For the three models of weighted sum, weighted average, and normalization, we estimated model parameters for one-half of the data and predicted voxel responses for the second half of the data.
Quantifying attentional effects
We defined two indices to quantify the observed effects of attention. The first index was used to compare voxel activities in paired conditions in which attention was directed toward the objects. We defined the response change index as the difference in average voxel activity when attention shifted from the preferred to the null stimulus, . The second index was used to quantify the asymmetry in attentional modulation. The asymmetry index, , compared the effect of the unattended stimulus on the response in conditions with unattended preferred or null stimuli. The comparison of observed indices with indices calculated from model predictions was done using the left-out part of the data.
Statistics
We performed sets of repeated measures ANOVAs to test for the main effects of the model, ROI, and their interaction for model goodness of fit and the reported effects of attention. For all performed ANOVAs, we used the Mauchly’s test to check whether the assumption of sphericity had been met. For cases where the assumption of sphericity had been violated, we used the Greenhouse-Geisser estimate to correct the degrees of freedom. Where applicable, we corrected for multiple comparisons using the Dunn-Sidak procedure.
To compare the goodness of fit of the three models across ROIs, we performed a repeated measures ANOVA with within-subject factors of the model (weighted sum, weighted average, and normalization), and ROI (V1, LO, pFs, EBA, and PPA) to test for the main effects of model and ROI and their interaction. Mauchly’s test indicated that the assumption of sphericity had been violated (). We thus corrected the degrees of freedom using the Greenhouse-Geisser estimate (). We also ran a one-way ANOVA to test for the effect of ROI on the difference between the noise ceiling and the normalization model’s r-squared (Equation 7). Mauchly’s test indicated that the assumption of sphericity had been violated (). The degrees of freedom were corrected using the Greenhouse-Geisser estimate ().
We then calculated the difference between the observed effects of attention in the data and the predictions of each model. To compare model predictions of the two attentional effects across ROIs, we ran two sets of 3 × 5 repeated measures ANOVAs with within-subject factors of model and ROI. Mauchly’s test indicated that the assumption of sphericity was met for both tests.
To compare the weighted average UW and UWUB model variants with the weighted average EW model and the normalization model, we ran a 4 × 5 repeated measures ANOVA with within-subject factors of model and ROI. Mauchly’s test indicated that the assumption of sphericity had been violated (), so we used the Greenhouse-Geisser estimate to correct the degrees of freedom (). Finally, to compare the fits of the normalization model and the weighted average UWUB saturation model across ROIs, we ran a 2 × 5 repeated measures ANOVA. Mauchly’s test showed a violation of the assumption of sphericity (), so the Greenhouse-Geisser estimate was used to correct the degrees of freedom ().
Data availability
fMRI data have been deposited in OSF under DOI https://doi.org/10.17605/OSF.IO/8CH9Q.
References
-
Mechanisms of top-down attentionTrends in Neurosciences 34:210–224.https://doi.org/10.1016/j.tins.2011.02.003
-
Representation of multiple objects in macaque category-selective areasNature Communications 9:1774.https://doi.org/10.1038/s41467-018-04126-7
-
Normalization governs attentional modulation within human visual cortexNature Communications 10:5660.https://doi.org/10.1038/s41467-019-13597-1
-
Linearity and normalization in simple cells of the macaque primary visual cortexThe Journal of Neuroscience 17:8621–8644.https://doi.org/10.1523/JNEUROSCI.17-21-08621.1997
-
Normalization as a canonical neural computationNature Reviews. Neuroscience 13:51–62.https://doi.org/10.1038/nrn3136
-
A taxonomy of external and internal attentionAnnual Review of Psychology 62:73–101.https://doi.org/10.1146/annurev.psych.093008.100427
-
A dynamic normalization model of temporal attentionNature Human Behaviour 5:1674–1685.https://doi.org/10.1038/s41562-021-01129-1
-
Neural mechanisms of selective visual attentionAnnual Review of Neuroscience 18:193–222.https://doi.org/10.1146/annurev.ne.18.030195.001205
-
Normalizing population receptive fieldsPNAS 118:e2118367118.https://doi.org/10.1073/pnas.2118367118
-
Effect of luminance contrast on BOLD fMRI response in human primary visual areasJournal of Neurophysiology 79:2204–2207.https://doi.org/10.1152/jn.1998.79.4.2204
-
Normalization of cell responses in cat striate cortexVisual Neuroscience 9:181–197.https://doi.org/10.1017/s0952523800009640
-
What does fMRI tell us about neuronal activity?Nature Reviews. Neuroscience 3:142–151.https://doi.org/10.1038/nrn730
-
When size matters: attention affects performance by contrast or response gainNature Neuroscience 13:1554–1559.https://doi.org/10.1038/nn.2669
-
Contrast dependence of response normalization in area MT of the rhesus macaqueJournal of Neurophysiology 88:3398–3408.https://doi.org/10.1152/jn.00255.2002
-
Changing the spatial scope of attention alters patterns of neural gain in human cortexThe Journal of Neuroscience 34:112–123.https://doi.org/10.1523/JNEUROSCI.3943-13.2014
-
Compressive spatial summation in human visual cortexJournal of Neurophysiology 110:481–494.https://doi.org/10.1152/jn.00105.2013
-
A two-stage cascade model of BOLD responses in human visual cortexPLOS Computational Biology 9:e1003079.https://doi.org/10.1371/journal.pcbi.1003079
-
The functional organization of high-level visual cortex determines the representation of complex visual stimuliThe Journal of Neuroscience 40:7545–7558.https://doi.org/10.1523/JNEUROSCI.0446-20.2020
-
Cortical regions involved in perceiving object shapeThe Journal of Neuroscience 20:3310–3318.https://doi.org/10.1523/JNEUROSCI.20-09-03310.2000
-
Attentional modulation of MT neurons with single or multiple stimuli in their receptive fieldsThe Journal of Neuroscience 30:3058–3066.https://doi.org/10.1523/JNEUROSCI.3766-09.2010
-
Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortexJournal of Neurophysiology 77:24–42.https://doi.org/10.1152/jn.1997.77.1.24
-
Neuronal mechanisms of visual attentionAnnual Review of Vision Science 1:373–391.https://doi.org/10.1146/annurev-vision-082114-035431
-
Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4The Journal of Neuroscience 19:431–441.https://doi.org/10.1523/JNEUROSCI.19-01-00431.1999
-
Neural mechanisms of selective visual attentionAnnual Review of Psychology 68:47–72.https://doi.org/10.1146/annurev-psych-122414-033400
-
Spatially tuned normalization explains attention modulation variance within neuronsJournal of Neurophysiology 118:1903–1913.https://doi.org/10.1152/jn.00218.2017
-
Neuronal effects of spatial and feature attention differ due to normalizationThe Journal of Neuroscience 39:5493–5505.https://doi.org/10.1523/JNEUROSCI.2106-18.2019
-
Top-Down control of visual attentionCurrent Opinion in Neurobiology 20:183–190.https://doi.org/10.1016/j.conb.2010.02.003
-
Contrast’s effect on spatial summation by macaque V1 neuronsNature Neuroscience 2:733–739.https://doi.org/10.1038/11197
-
Goal-Directed visual processing differentially impacts human ventral and dorsal visual representationsThe Journal of Neuroscience 37:8767–8782.https://doi.org/10.1523/JNEUROSCI.3392-16.2017
-
Dynamic shifts of visual receptive fields in cortical area MT by spatial attentionNature Neuroscience 9:1156–1160.https://doi.org/10.1038/nn1748
-
A normalization framework for emotional attentionPLOS Biology 14:e1002578.https://doi.org/10.1371/journal.pbio.1002578
-
Multiple object response normalization in monkey inferotemporal cortexThe Journal of Neuroscience 25:8150–8164.https://doi.org/10.1523/JNEUROSCI.2058-05.2005
Decision letter
-
Marisa CarrascoReviewing Editor; New York University, United States
-
Tirin MooreSenior Editor; Howard Hughes Medical Institute, Stanford University, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Evidence for Normalization as a Fundamental Operation Across the Human Visual Cortex" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Tirin Moore as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. Please note that the reviewers think that the paper has potential but extensive revisions are needed.
Essential Revisions:
Theoretical framework and interpretation
(1) The title is misleading and a bit grandiose – "In" instead of "across". Across would imply evidence for normalization across all ~30 areas of visual brain, whereas the authors just have 5. Suggestion to replace with 'in Human visual cortex' or 'in object-selective and category-selective regions in human visual cortex' (see related next point).
(2) Given that the normalization model is a favorite in the attended conditions but less so in the other conditions, the abstract and title could be toned down, as this gives the impression that the normalization model consistently outperforms the other models.
(3) Lines 18-20 – Include more recent references
(4) Line 21 – Explain how normalization regulates the gain of the attended stimulus in the monkey brain.
(5) The authors state at multiple points in the manuscript that 'no study to date has directly tested the validity of the normalization model in predicting human cortical responses". Which is odd because the authors then go to cite a few of the studies that have done exactly that. Moreover, there are a number of other studies left out of, such as Ithipuripat et al., 2014. Perhaps the authors mean 'using fMRI', but in that case there still remains a few (including that cited) that suggest otherwise. The authors should soften this statement, or clarify on what they mean.
(6) Lines 23-25 – Related to the previous point. There is a recent study in PNAS which must be cited Aqil, Knapen and Dumoulin (PNAS 2021). Moreover, the authors should discuss how this manuscript goes beyond that study and discuss any supporting or conflicting conclusions. See also the Commentary by Foster and Ling (PNAS 2021).
Also, there is evidence of normalization in behavioral studies that seems worth mentioning; e.g., Herrmann et al. (Nature Neurosci 2010, JoV 2012); Schwedhelm et al. (PLoS Comput Biol 2016).
(7) The authors should discuss if there is neural and/or behavioral evidence of normalization in object-based attention (which would also provide some justification for the stimuli chosen, rather than what prior work of a similar vein has used i.e. contrast gratings). This is important as it is the manipulation in the present study, but most of the papers in the Introduction refer to spatial attention, and it should not be taken for granted that either the behavioral or the neural effects are the same for different types of attention or for different types of stimuli.
(8) Line 33: Specify which models? The normalization and weighted average model?
(9) Line 55-56. The authors refer to the figure to show how BOLD response changes across task conditions, but there is no interpretation of these data. How do we expect the raw data to look? Does this match the current data? If so why, if not, why not? In its current form, a figure of data is shown without explaining what it means. The authors could present examples of the BOLD signal elicited from the seven conditions; doing so would provide a deeper look at the data and how the fMRI signal changes with condition. Is there a pattern to the response?
(10) Line 86-89: The authors state, "As evident in the figure, the predictions of the normalization model (in orange) very closely follow the observed data (in navy), while the predictions of the weighted sum and weighted average models (light blue and gray, respectively) have significant deviations from the data". In which ROIs? All of them? It looks like the normalization model looks close to the data for PFs, EBA and LO. For V1 and PPA it is better than the other two but it is not as close. Is this what is shown in 2C?
Suggestion to order the panels in Figure 2 logically (the BOLD and model data all presented in A for all 5 ROIs) – there's no reason to just show LO1 as the main figure because it looks best, then the variance explained panel after – then separately the R2. Much of the paper comes across as 'normalization is better for all' and removes any nuance in differences in different areas; why might it be better in some than others? This is not detailed in the manuscript.
(11) Could the asymmetry effect found in the BOLD response be driven potentially by nonlinearities in the BOLD response, rather than being neural in origin? Do the authors have ways to mitigate this concern? One thought would be to show the relation (or lack thereof) between voxels that have a high BOLD response, and their degree of asymmetry. If the effect is not due to saturation, one would expect no significant relation.
(12) The changes in response with attention and the asymmetry effects are predictions of the biased competition model of attention and reference should be made to this literature (e.g., papers by Desimone and Duncan). In addition, this is not the first time that an asymmetry effect is observed in the fMRI literature (line 131). This effect is a direct prediction of the biased competition theory of attention, and has been previously reported in the fMRI literature as well (e.g., in reference 12).
(13) Line 178-181 – The limitation mentioned here is very vague. Some discussion of the limitations should go beyond 'we are aware there are limitations and hope one day they can be overcome'. What do these limitations mean for the data analysis and the interpretation of results? What could alternative explanations be? How will they be overcome?
Model
(1) A main concern regarding the interpretation of these results has to do with the sparseness of data available to fit with the models. The authors pit two linear models against a nonlinear (normalization) model. The predictions for weighted average and summed models are both linear models doomed to poorly match the fMRI data, particularly in contrast to the nonlinear model. So, while the verification that responses to multiple stimuli don't add up or average each other is appreciated, the model comparisons seem less interesting in this light. The model testing endeavor seems rather unconstrained. A 'true' test of the model would likely need a whole range of contrasts tested for one (or both) of the stimuli. Otherwise, as it stands we simply have a parameter (σ) that instantly gives more wiggle room than the other models. It would be fairer to pit this normalization model against other nonlinear models. Indeed, this has already been done in previous work by Kendrick Kay, Jon Winawer and Serge Dumoulin's groups. The same issue of course extends to the attended conditions.
(2) The difference in the number of free parameters biases the results. The normalization model has more free parameters than the other two models. The authors use a split-half approach to avoid the problem of overfitting, but a model with more free parameters in and of itself potentially has a better capacity to fit the data compared to the other models. This built in bias could account for the results. Model comparison (like with the AIC measure) is necessary to account for this difference.
(3) Related to the above point, the response to the Pref (P) and Null (N) stimuli are also free parameters in the normalization model and it's not clear why the values of Rp and Rn are not used here as for the other two models? These differences could again account for the better fit observed for the normalization model. Further, what are the values of Cp and Cn in the normalization model? How are they determined?
(4) The weighted average model is confusing. Equation 5a is a strict average model. The weighted average model (Desimone and Duncan) however proposes that the response to combined stimuli is a weighted average of the response to the individual stimuli (i.e., with possibly different weights for each stimulus). Second, attention biases these weight values by possibly different amounts for different stimuli/in different brain regions. In this study, the response to the combined stimuli is modeled as a strict average (with a fixed weight of 0.5), and further, one fixed weight (β) is assigned to P and N for attention effects. However, in the weighted average model, the effect of attention could potentially be different for the preferred and null stimuli (i.e., a different β for P and another β for N), which might generate a better fit for the weighted average model. Indeed, although it is difficult to read off the graph, attention appears to enhance the response to the N stimuli more than to the P stimuli. How might the results be affected if the weights in the PN condition are not fixed to 0.5, and if attention is allowed to differently affect the two types of stimuli? A similar argument could also be made for the weighted sum model.
(5) Line 57-61: What is the justification of using the weighted sum model when we know that this model is not reflective of what is going on in the brain? It sounds like it has been set up to fail. Has the weighted sum model been shown to predict responses in prior work? It is never really brought up again in the discussion, so its purpose is unclear. Wouldn't a comparison of the weighted average and normalization model make more sense? Better yet, consider comparisons with other model(s) beyond normalization that could explain the data.
Methods and data analysis
(1) FIGURE 1 – protocol – Is task difficulty equated across house category, body category, and color change at fixation?
(2) Did participants complete any practice trials for the one back task, or just went right into the scanner? Was the accuracy of the participants in the one back task recorded to ensure that they were paying attention and completing the task properly? Same for the fixation task
(3) Most of page 6 is a repetition of the methods (much copy and paste), not really needed, instead a short summary would be better or just move the methods to after the introduction.
(4) Some of the details regarding the actual design of the fMRI experiment seem glossed over. From what I can gather, each block was only a few seconds, with no rest between blocks? If so, the BOLD response would be exceptionally driven to saturating levels throughout the experiment. That said, it is difficult to unpack the details based on the Methods section as vague detail is provided.
(5) Lines 47-51. It is unclear how preferred (P) or null (N) stimulus categories were assigned to each voxel. Does a voxel preferring houses respond greater to houses than a body? Was there a statistical measure for determining what was the P or N stimulus for a given voxel? We imagine that almost all voxels in EBA significantly preferred bodies, and almost all voxels in PPA preferred houses. If so, how is this accounted for? Was there a reliable, statistically significant difference in house or body preference for voxels in the other ROIs, or is the difference between response activation to the two categories marginal? How many voxels preferred houses vs. bodies in each ROI? For each voxel, how reliable is the attribution of the P or N stimulus (e.g., did you use a cross-validation approach and determine P/N on some runs and tested the reliability on the remaining runs)?
(6) Were the P and N categories determined on all the data (a risk of double-dipping) or only on one half of the dataset?
(7) Provide more details of the GLM. What were the regressors?
(8) Lines 79-81: Were the model parameters determined in the second half and predictions made in the first half as well? Would averaging the two directions provide more robust results?
(9) Category stimuli were fit within a 10.2 degree square of visual angle. The methods describe the V1 localizer as a wedge 60 degree in angle. What was the stimulus extent (i.e. eccentricity) of the wedge? Was it also 10.2 degrees so that the V1 ROI encompassed the same amount of visual space as the category stimuli? Same question for the Category localizer. Were these stimuli also 10.2 degrees? Were the ROIs (at least for V1?) defined within the same eccentricity range so that the ROI analysis was confined to where the category stimuli were presented in the visual field?
(10) What software was used to define LOC PF EBA and PPA? And what metric do we have to ensure that these are reliable ROIs? The authors would benefit by including figures that show examples of the ROIs overlaid on contrast data for the localizer conditions. The same goes for the β weights within the ROIs.
(11) Preprocessing – how were the data aligned?
(12) The paper refers to a Supplementary Methods, which I could not find.
Results and statistics
(1) What was accuracy like on these tasks? There was no description of performance at all. It is necessary to provide this information to assess the attentional manipulation
(2) Line 96 – What is ts? T-statistic? Did you run a series of t-tests comparing normalization vs noise ceiling? Why not an ANOVA and correct for multiple comparisons? And what does it mean that the R2 was not significant different from noise ceiling? Interpretation of these results must be provided.
(3) Lines 119-123. Again, why a t-test statistic here? Did you run multiple t-tests for each ROI? Shouldn't this be an ANOVA across the 5 ROIs? As it stands, line 120 says that you ran multiple t tests across all the regions, and then report single t-statistic? The paper needs to be clear (and correct) in the type of statistics it uses to make its claims.
(4) There is no assessment of the quality of the data – no Β weight maps showing the responses on the brain, or what proportion of voxels in each ROI actually responded to the stimuli. A lot of interesting data that could be shown is reduced to line plots -this comes across as the paper starting at the end, rather than providing any sense of buildup in the data analysis and the conclusions that the authors draw.
(5) The authors show the goodness of fit across the 5 modeled conditions in 2C – fine, but what is the justification for lumping all the conditions together? Is there a supplement figure that teases this apart, with the goodness of fit for each model for each of the conditions, for each ROI? It would help to be comprehensive.
(6) Averaging all the data together (e.g., in Figure 2c across all conditions) does not provide a clear picture of the results. Is it possible that it is only in the attention conditions that the normalization model outperforms the others (although again, note the remark about the different numbers of free parameters in each model)? It would be more transparent to plot Figure 2c separately for the isolated/paired conditions, even though the story might become more complex.
(7) Related to the above points, the data are provided in a very synthesized manner, and raw results are not shown. For example, could we see the values of the estimated model parameters? How do they compare when determined in each half of the data (to get a sense of reliability)?
Figures
(1) Figure 1b: The gray squares overlap a lot so that we don't see the stimuli in each gray square.
(2) The y-axis label in Figure 2A is not BOLD response (this would be appropriate if the figure was a BOLD time series); is this the average fMRI Β weight?
(3) In Figure 2 (a and b), the connecting of data points as if they are on a continuum is not correct given their categorical nature. Moreover, the figure could be improved so that the actual data points are made more prominent, and the model predictions are dashed lines. What exactly do the error bars in Figure 2a and 2b correspond to?
(4) Suggestion to order the panels in Figure 2 logically (the BOLD and model data all presented in A for all 5 ROIs).
(5) Is there any statistic to help evaluate the data in Figure 2C? (Line 93)
(6) Figure 3 – Why not include the BOLD response for P only and N only? That would be informative.
https://doi.org/10.7554/eLife.75726.sa1Author response
Essential Revisions:
Theoretical framework and interpretation
(1) The title is misleading and a bit grandiose – "In" instead of "across". Across would imply evidence for normalization across all ~30 areas of visual brain, whereas the authors just have 5. Suggestion to replace with 'in Human visual cortex' or 'in object-selective and category-selective regions in human visual cortex' (see related next point).
We changed the title from “Evidence for Normalization as a Fundamental Operation Across the Human Visual Cortex” to “Evidence for Normalization as a Fundamental Operation in the Human Visual Cortex”.
(2) Given that the normalization model is a favorite in the attended conditions but less so in the other conditions, the abstract and title could be toned down, as this gives the impression that the normalization model consistently outperforms the other models.
Based on comment 9 from the reviewers in the Methods and Data Analysis section of this letter, we changed our analyses to only include active voxels in each ROI to account for the differences in the stimulus size between the main task and the localizer tasks. This extra step improved our results so that the normalization model is now better than the weighted average model for both the attended and unattended conditions. As such, we decided to keep the title. Nevertheless, to more accurately convey the results, we have slightly modified the sentence in the abstract:
“Focusing on the primary visual area V1, the object-selective regions LO and pFs, the body-selective region EBA, and the scene-selective region PPA, we first modeled single-voxel responses using a weighted sum, a weighted average, and a normalization model and demonstrated that although the weighted sum and the weighted average models also made acceptable predictions in some conditions, the response to multiple stimuli could generally be better described by a model that takes normalization into account.”
(3) Lines 18-20 – Include more recent references
We have now included more recent references in the Introduction section in lines 50-54:
“In addition to regional computations for multiple-stimulus representation, the visual cortex relies on top-down mechanisms such as attention to select the most relevant stimulus for detailed processing (Moran and Desimone 1985, Desimone and Duncan 1995, Chun et al. 2011, Baluch and Itti 2011, Noudoost et al. 2010, Maunsell 2015, Thiele and Bellgrove 2018, Itthipuripat et al. 2014, Moore et al. 2017, Buschman and Kastner 2015).”
(4) Line 21 – Explain how normalization regulates the gain of the attended stimulus in the monkey brain.
We have replaced the sentence with a short description in the Introduction section in lines 56-60:
“Previous studies have demonstrated how the normalization computation accounts for these observed effects of attention in the monkey brain. They have suggested that normalization attenuates the neural response in proportion to the activity of the neighboring neuronal pool (Reynolds Heeger 2009, Lee Maunsell 2009, Boynton 2009, Ni et al. 2012).”
(5) The authors state at multiple points in the manuscript that 'no study to date has directly tested the validity of the normalization model in predicting human cortical responses". Which is odd because the authors then go to cite a few of the studies that have done exactly that. Moreover, there are a number of other studies left out of, such as Ithipuripat et al., 2014. Perhaps the authors mean 'using fMRI', but in that case there still remains a few (including that cited) that suggest otherwise. The authors should soften this statement, or clarify on what they mean.
We thank the reviewer for pointing out the apparent paradox in our phrasing. We have modified the introduction and hope that the following paragraph can better convey the contribution of our study (added in the Introduction section in lines 73-82):
“Although previous studies have qualitatively suggested the role of normalization in the human visual cortex (Kliger and Yovel 2020, Itthipuripat et al. 2014, bloem and Ling 2019), evidence for directly testing the validity of the normalization model in predicting human cortical responses in a quantitative way remains scarce. A few studies have demonstrated the quantitative advantage of normalization-based models compared to linear models in predicting human fMRI responses using gratings, noise patterns, and single objects (Kay et al. 2013a, Kay et al. 2013b), as well as moving checkerboards (Aqil et al. 2021, Foster and Ling 2021). However, whether normalization can also be used to predict cortical responses to multiple objects, and if and to what extent it can explain the modulations in response caused by attention to objects in the human brain remain unanswered.”
(6) Lines 23-25 – Related to the previous point. There is a recent study in PNAS which must be cited Aqil, Knapen and Dumoulin (PNAS 2021). Moreover, the authors should discuss how this manuscript goes beyond that study and discuss any supporting or conflicting conclusions. See also the Commentary by Foster and Ling (PNAS 2021).
Also, there is evidence of normalization in behavioral studies that seems worth mentioning; e.g., Herrmann et al. (Nature Neurosci 2010, JoV 2012); Schwedhelm et al. (PLoS Comput Biol 2016).
We have added citations of the study by Aqil et al., and the commentary by Foster and Ling, as well as how our study goes beyond that study in the Introduction section in lines 76-82:
“A few studies have demonstrated the quantitative advantage of normalization-based models compared to linear models in predicting human fMRI responses using gratings, noise patterns, and single objects (Kay et al. 2013a, Kay et al. 2013b), as well as moving checkerboards (Aqil et al. 2021, Foster and Ling 2021). However, whether normalization can also be used to predict cortical responses to multiple objects, and if and to what extent it can explain the modulations in response caused by attention to objects in the human brain remain unanswered.”
We have also added citations of behavioral studies that report evidence of normalization in the Introduction section in lines 69-72:
“In the human visual cortex, normalization has been speculated to underlie response modulations in the presence of attention, with evidence provided both by behavioral studies of space-based (Hermann et al. 2010) and feature-based (Hermann et al. 2012, Schweldhelm et al. 2016) attention as well as neuroimaging studies of feature-based attention (Bloem and Ling 2019).”
(7) The authors should discuss if there is neural and/or behavioral evidence of normalization in object-based attention (which would also provide some justification for the stimuli chosen, rather than what prior work of a similar vein has used i.e. contrast gratings). This is important as it is the manipulation in the present study, but most of the papers in the Introduction refer to spatial attention, and it should not be taken for granted that either the behavioral or the neural effects are the same for different types of attention or for different types of stimuli.
We agree with the reviewer that the effects of space-based attention cannot be readily generalized to object-based attention. As the reviewer has pointed out, this is precisely why we ran this study. We could not find evidence of normalization in object-based attention in the literature. However, there is evidence for normalization in feature-based attention. We have now separated the references related to feature- and space-based attention and have made our contribution related to object-based attention more explicit in the Introduction section in lines 56-68:
“Previous studies have demonstrated how the normalization computation accounts for these observed effects of attention in the monkey brain. They have suggested that normalization attenuates the neural response in proportion to the activity of the neighboring neuronal pool (Reynolds Heeger 2009, Lee Maunsell 2009, Boynton 2009, Ni et al. 2012). These studies have focused on space-based (Reynolds Heeger 2009, Lee Maunsell 2009, Ni et al. 2012,) or feature-based (Ni et al. 2019) attention. While it has been suggested that these different forms of attention affect neural responses in similar ways, there exist distinctions in their reported effects, such as different time courses (Hayden and Gallant 2005), and the extent to which they affect different locations in the visual field (Serences and Boynton 2007, Womelsdorf et al. 2006), suggesting that there are common sources as well as differences in modulation mechanisms between these forms of attention (Ni and Maunsell 2019). This leaves open the question of whether normalization can explain the effects of object-based attention.”
(8) Line 33: Specify which models? The normalization and weighted average model?
We have included the name of the models in the sentence to avoid confusion in the Introduction section in lines 89-92:
“We also demonstrate that normalization is closer to averaging in the absence of attention, as previously reported by several studies, but that the results of the weighted average model and the normalization model diverge to a greater extent in the presence of attention.”
(9) Line 55-56. The authors refer to the figure to show how BOLD response changes across task conditions, but there is no interpretation of these data. How do we expect the raw data to look? Does this match the current data? If so why, if not, why not? In its current form, a figure of data is shown without explaining what it means. The authors could present examples of the BOLD signal elicited from the seven conditions; doing so would provide a deeper look at the data and how the fMRI signal changes with condition. Is there a pattern to the response?
We thank the reviewer for this suggestion. We agree that including the response in the seven conditions would be very helpful for readers. We have added a figure illustrating the average fMRI responses across voxels for all conditions and in all regions, see Figure 2.
We have also included our interpretation of the observed results in the Results section in lines 118-150:
“To examine the cortical response in different task conditions, we fit a general linear model and estimated the regression coefficients for each voxel in each condition. Figure 2a-e illustrates the average voxel coefficients for different conditions in the five regions of interest (ROIs), including V1, LO, pFs, EBA, and PPA. Each task condition was named based on the presented stimuli and the target of attention, with B and H denoting the presence of body and house stimuli, respectively, and the superscript “at” denoting the target of attention. Note that we have not included the responses related to the fixation block with no stimulus since this condition was only used to select the voxels that were responsive to the presented stimuli in each ROI (see Methods). We observed that the average voxel coefficients related to the four conditions in which attention was directed to the body or the house stimuli (the first four conditions, Bat, BatH, BHat, Hat) were generally higher than the response related to the last three conditions (B, H and BH conditions) in which the body and house stimuli were unattended (ts > 4, ps <1.7⨯10-3, corrected). This is in agreement with previous research indicating that attention to objects increases their cortical responses (Roelfsema 1998, O’craven 1999, Reddy 2009).
Looking more closely at the results in the regions EBA and PPA that have strong preferences for body and house stimuli, respectively, it seems that the effect of attention interacts with the regions’ preference. For instance, in the body-selective region EBA, the response to attended body stimuli in isolation is similar to the response to attended body stimuli paired with unattended house stimuli (compare Bat and BatH bars). On the other hand, the response to attended house stimuli in the isolated condition is significantly less than the response to attended house stimuli paired with unattended body stimuli. We can observe similar results in PPA, but not in V1 or the object-selective regions LO and pFs. But note that the latter three regions do not have strong preferences for one of the stimuli. Therefore, in order to examine the interaction between attention and preference more closely, we determined preferences at the voxel level in all ROIs.
We defined the preferred (P) and null (N) stimulus categories for each voxel in each ROI according to the voxel’s response to isolated body and isolated house conditions. Figure 2f shows the percentage of voxels in each region that were selective to bodies and houses averaged across participants. As illustrated in the figure, in the object-selective regions LO and pFs, almost half of the voxels were selective to each category, while in the EBA and PPA regions, the general preference of the region prevailed (Even though these regions are selected based on their preference), the noise in the fMRI data and other variations due to imperfect registration could lead to some voxels showing different preferences in the main session compared to the localizer session (Peelen and Downing 2005).”
In addition, we have now included the interpretation of the results illustrated in figure 3 (figure 2 in the previous version) in lines 160-173:
“We observed that the mean voxel response was generally higher when each stimulus was attended compared to the condition in which it was ignored. For instance, the response in the Pat condition (in which the isolated preferred stimulus was attended) was higher than in the P condition (where the isolated preferred stimulus was ignored) in LO, pFs, and PPA (ts > 3.6, ps < 0.01, corrected), marginally higher in EBA (t(18)=2.69, p=0.072, corrected), and not significantly higher in V1 (t(18)=2.52, p=0.1, corrected). Similarly, comparing the N and Nat conditions in each ROI, we observed an increase in response caused by attention in all ROIs (ts > 4, ps < 3.7⨯10-3, corrected) except for V1 (t(18)=2.4, p=0.13, corrected). A similar trend of response enhancement due to attention could also be observed in the paired conditions: attending to either stimulus increased the response in all ROIs (ts > 4.4, ps < 1.5⨯10-3, corrected) except for V1 (ts<2.59, ps > 0.087, corrected). In all cases, the effect of attention was absent or only marginally significant in V1, which is not surprising since attentional effects are much weaker (Mcadams and Maunsell 1999) or even absent (Luck et al. 1997) in V1 compared to the higher-level regions of the occipito-temporal cortex.”
(10) Line 86-89: The authors state, "As evident in the figure, the predictions of the normalization model (in orange) very closely follow the observed data (in navy), while the predictions of the weighted sum and weighted average models (light blue and gray, respectively) have significant deviations from the data". In which ROIs? All of them? It looks like the normalization model looks close to the data for PFs, EBA and LO. For V1 and PPA it is better than the other two but it is not as close. Is this what is shown in 2C?
Suggestion to order the panels in Figure 2 logically (the BOLD and model data all presented in A for all 5 ROIs) – there's no reason to just show LO1 as the main figure because it looks best, then the variance explained panel after – then separately the R2. Much of the paper comes across as 'normalization is better for all' and removes any nuance in differences in different areas; why might it be better in some than others? This is not detailed in the manuscript.
We thank the reviewer for this suggestion. We agree that a logical order of the panels is more appropriate. We have now ordered the panels more logically (V1- Lo- pFs- EBA- PPA), all with the same size (see Figure 3):
To answer the reviewer’s question about the differences across areas, after we selected active voxels in each region for further analyses (this was done to answer comment 9 from the reviewers in the Methods and Data Analysis section of this letter), we calculated the normalization model’s r-squared difference from the noise ceiling. This measure was not significantly different across ROIs (repeated measures ANOVA, F(4,72)=0.58, p = 0.61). Therefore, we can say that the fit of the normalization model is not significantly worse in any of the ROIs. However, we have toned down the point about the superiority of the normalization model since its predictions are not similarly close to the data across all conditions in the Results section lines 218-220:
“As evident in the figure, the predictions of the normalization model (in orange) are generally better than the predictions of the weighted sum and the weighted average models (light blue and gray, respectively) in all regions. “
We have also included the results of the ANOVA test comparing the difference between the noise ceiling and the normalization fit across ROIs, showing that there is no significant difference in this measure across ROIs in the Results section in lines 240-246:
“We then calculated the normalization model’s r-squared difference from the noise ceiling (NRD) for each ROI. NRD is a measure of the ability of the model in accounting for the explainable variation in the data; the lower the difference between the noise ceiling and a model’s goodness of fit, the more successful that model is in predicting the data. We ran a one-way ANOVA to test for the effect of ROI on NRD, and observed that this measure was not significantly different across ROIs (F(4,72)=0.58, p = 0.61), demonstrating that the normalization model was equally successful across ROIs in predicting the explainable variation in the data.”
We have also included the details of model comparison, showing which model performed better in each region, in the Results section in lines 229-239:
“We first compared the goodness of fit of the three models across the five ROIs using a 3⨯5 repeated measures ANOVA. The results showed a significant main effect of model (F(2,36) = 72.9, p = 9.86⨯10-11) and ROI (F(4,72) = 26.66, p = 1.04⨯10-7), and a significant model by ROI interaction (F(8,144) = 24.96, p = 3.74⨯10-15). On closer inspection, the normalization model was a better fit to the data than both the weighted sum (ps < 0.019, corrected) and the weighted average (ps < 5.7⨯10-5, corrected) models in all ROIs. Since the normalization model had more parameters, we also used the AIC measure to correct for the difference in the number of parameters. The normalization model was a better fit according to the AIC measure as well (see supplementary file 2).
It is noteworthy that while the weighted average model performed better than the weighted sum model in LO and EBA (ps < 0.0016, corrected), it was not significantly better in pFs and PPA (ps > 0.37, corrected), and worse than weighted sum in V1 (p = 6.5⨯10-6, corrected).”
We have discussed the difference in the fit of the weighted average and weighted sum models across regions in the Discussion section in lines 422-430:
“Our results demonstrate that while the weighted average model generally performs better than the weighted sum model in the higher-level occipito-temporal cortex, the weighted sum model provides better predictions in V1. These results suggest stronger sublinearity in higher regions of the visual cortex compared to V1, which is in agreement with previous reports (Kay et al. 2013b). This observation might be related to the higher sensitivity of V1 neurons to contrast (Goodyear and Menon, 1998), causing a more significant decrease in V1 responses to low-contrast stimuli. This, in turn, might make the low-contrast stimulus weaker for V1 neurons, causing a move towards a lower level of sublinearity (Sceniak et al. 1999).”
(11) Could the asymmetry effect found in the BOLD response be driven potentially by nonlinearities in the BOLD response, rather than being neural in origin? Do the authors have ways to mitigate this concern? One thought would be to show the relation (or lack thereof) between voxels that have a high BOLD response, and their degree of asymmetry. If the effect is not due to saturation, one would expect no significant relation.
The reviewer raises a very good point.
To examine whether the observed asymmetry effect was driven by BOLD response saturation, we tested a saturation model (the weighted average UWUB saturation, detailed in the Methods section). This model’s goodness of fit, as well as its predictions of response change and asymmetry, are illustrated in the figure 5b,c in the paper:
As illustrated in the figure 5 panel c, the saturation model’s predicted asymmetry was closer to the data than normalization’s prediction only in EBA (p = 0.043, corrected). In other regions, the normalization model’s prediction of asymmetry was either significantly closer to the data (in V1, LO and pFs, ps < 1.02⨯10-6 , corrected) or not significantly different from the saturation model (in PPA p = 0.63, corrected).
We then compared the overall fits of the two models by running a 2⨯5 ANOVA across ROIs. We observed a significant effect of model (F(1,18) = 91.16, p = 1.8⨯10-8), a significant effect of ROI (F(4,72) = 19.46, p = 2.21⨯10-7), and a significant model by ROI interaction (F(4,72) = 4.82, p = 0.0061). Post-hoc t-tests showed that the normalization model was a significantly better fit than the saturation model in all ROIs (ps< 0.008, corrected).
Note that saturation is a characteristic of cortical responses even at the neural level. Our results are in agreement with previous monkey electrophysiology studies, showing the existence of an asymmetry in attentional modulation for attention to the preferred versus the null stimulus and demonstrating that the normalization model can explain this effect (Ni and Maunsell, 2012). Since the asymmetry in attentional modulation has also been previously reported at the neural level (Ni and Maunsell, 2012), this effect is unlikely to be only related to the nonlinearities in the BOLD signal.
We have added these results in the Results section in lines 345-357:
“Next, to explore whether the observed asymmetry was caused by response saturation, we tested a nonlinear variant of the weighted average model with saturation (the weighted average UWUB saturation model). This model’s goodness of fit, as well as its predictions of response change and asymmetry, are illustrated in figure 5b,c. As illustrated in the figure, the saturation model’s predicted asymmetry was closer to the data than normalization’s prediction only in EBA (p = 0.043, corrected). In other regions, the normalization model’s prediction of asymmetry was either significantly closer to the data (in V1, LO and pFs, ps < 1.02⨯10-6 , corrected) or not significantly different from the saturation model (in PPA, p = 0.63, corrected).
After running a 2⨯5 ANOVA to compare the fits of the normalization model and the weighted average UWUB saturation model across ROIs, we observed a significant effect of model (F(1,18) = 91.16, p = 1.8⨯10-8), a significant effect of ROI (F(4,72) = 19.46, p = 2.21⨯10-7), and a significant model by ROI interaction (F(4,72) = 4.82, p = 0.0061). Post-hoc t-tests showed that the normalization model was a significantly better fit than the saturation model in all ROIs (ps< 0.008, corrected).”
We have also included discussions on asymmetry in the Discussion section in lines 468-479:
“Another limitation in interpreting the results is related to a possible stronger saturation of the BOLD response, which can potentially explain the observed asymmetry in attentional modulation. Since the asymmetry in attentional modulation has also been previously reported at the neural level (Ni and Maunsell, 2012), this effect is unlikely to be exclusively caused by the nonlinearities in the BOLD signal. It is noteworthy, however, that saturation is a characteristic of cortical responses even at the neural level. Whether this effect at the neural level is caused by response saturation or as a result of the normalization computation cannot be distinguished from our current knowledge. Nevertheless, we tested a variant of the weighted average model with an extra saturation parameter. Although this model could partially predict the observed asymmetry, the predictions were worse than the normalization model’s predictions. Also, this model was an overall worse fit to the data compared to the normalization model. The normalization model, therefore, provides a more parsimonious account of the data.”
(12) The changes in response with attention and the asymmetry effects are predictions of the biased competition model of attention and reference should be made to this literature (e.g., papers by Desimone and Duncan). In addition, this is not the first time that an asymmetry effect is observed in the fMRI literature (line 131). This effect is a direct prediction of the biased competition theory of attention, and has been previously reported in the fMRI literature as well (e.g., in reference 12).
The biased competition model predicts the change in response with attention (reduction in response when shifting attention from preferred to null stimuli), and we have added references to the related work in the Discussion section. However, our modeling results showed that the predictions of the weighted average model (as a quantitative derivation of the biased competition model) were significantly lower for the amount of change in response by attention than what was observed in the data. On the other hand, this response change was closely predicted by the normalization model.
We have included reference to the biased competition model in the Discussion section in lines 377-382:
“Although this response reduction has also been predicted by the biased competition model (Desimone and Duncan 1995), our results showed that the predictions of the weighted average model were significantly lower than the response reduction observed in the data in all ROIs except in V1. In contrast, the normalization model predicted the response reduction more accurately in all regions except in V1, where no significant response reduction was observed.”
Regarding the asymmetry effect, we are unaware of any previous studies showing or predicting the asymmetry effect in the framework of the biased competition model. In fact, Reddy et al. (2009) used a weighted average model and reported a bias of 30% when attention was directed to one stimulus compared to when it was divided between the two stimuli. They did not sort conditions based on voxel preference for the presented stimuli and reported an overall 30% weight shift with attention, with no significant main effect of category, which suggests no asymmetry.
According to our modeling results, the weighted average variant similar to the one used by Reddy et al. (the “weighted average UW” model in the Other variants of the weighted average model section) did not predict such an asymmetry. A nonlinear variant of the weighted average model in which we added a saturation parameter to the linear weighted average version could predict this effect, but the predictions were not as close to the data as the normalization model.
(13) Line 178-181 – The limitation mentioned here is very vague. Some discussion of the limitations should go beyond 'we are aware there are limitations and hope one day they can be overcome'. What do these limitations mean for the data analysis and the interpretation of results? What could alternative explanations be? How will they be overcome?
We have added the following sentences to the Discussion section in lines 455-479 to elaborate on the limitations of our study:
“It is noteworthy that here, we are looking at the BOLD responses. We are aware of the limitations of the fMRI technique as the BOLD response is an indirect measure of the activity of neural populations. While an increase in the BOLD signal could be related to an increase in the neuronal firing rates of the local population (Logothetis et al. 2001), it could also be related to subthreshold activity resulting from feedback from the downstream regions of the visual cortex (Heeger and Ress 2002). The observed effects, therefore, may be related to local population responses or may be influenced by feedback from downstream regions. Also, since the measured BOLD signal is related to the average activity of the local population, and we do not have access to single unit responses, some effects may change in the averaging process. Nevertheless, our simulation results show that the effects of the normalization computation are preserved even after averaging. We should keep in mind, though, that these are only simulations and are not based on data directly measured from neurons. Future experiments with intracranial recordings from neurons in the human visual cortex would be invaluable in validating our results.
Another limitation in interpreting the results is related to a possible stronger saturation of the BOLD response, which can potentially explain the observed asymmetry in attentional modulation. Since the asymmetry in attentional modulation has also been previously reported at the neural level (Ni and Maunsell, 2012), this effect is unlikely to be exclusively caused by the nonlinearities in the BOLD signal. It is noteworthy, however, that saturation is a characteristic of cortical responses even at the neural level. Whether this effect at the neural level is caused by response saturation or as a result of the normalization computation cannot be distinguished from our current knowledge. Nevertheless, we tested a variant of the weighted average model with an extra saturation parameter. Although this model could partially predict the observed asymmetry, the predictions were worse than the normalization model’s predictions. Also, this model was an overall worse fit to the data compared to the normalization model. The normalization model, therefore, provides a more parsimonious account of the data.”
Model
(1) A main concern regarding the interpretation of these results has to do with the sparseness of data available to fit with the models. The authors pit two linear models against a nonlinear (normalization) model. The predictions for weighted average and summed models are both linear models doomed to poorly match the fMRI data, particularly in contrast to the nonlinear model. So, while the verification that responses to multiple stimuli don't add up or average each other is appreciated, the model comparisons seem less interesting in this light. The model testing endeavor seems rather unconstrained. A 'true' test of the model would likely need a whole range of contrasts tested for one (or both) of the stimuli. Otherwise, as it stands we simply have a parameter (σ) that instantly gives more wiggle room than the other models. It would be fairer to pit this normalization model against other nonlinear models. Indeed, this has already been done in previous work by Kendrick Kay, Jon Winawer and Serge Dumoulin's groups. The same issue of course extends to the attended conditions.
We thank the reviewer for this comment. Regarding the reviewer’s concern about comparing a nonlinear model with a linear one, we have tried to answer with two different approaches. First, as suggested by the reviewer, we used a nonlinear variant of the weighted average model with an extra saturation parameter. This model had distinct free parameters for the weights of the preferred and null stimuli and distinct attention parameters for the preferred and null stimuli. It, therefore, had five free parameters. The comparison between the results of this model with the normalization model’s predictions showed that the predictions of this model were significantly worse than normalization in all regions (ps < 0.008, corrected).
We have added these results in the Results section in lines 352-357:
“After running a 2⨯5 ANOVA to compare the fits of the normalization model and the weighted average UWUB saturation model across ROIs, we observed a significant effect of model (F(1,18) = 91.16, p = 1.8⨯10-8), a significant effect of ROI (F(4,72) = 19.46, p = 2.21⨯10-7), and a significant model by ROI interaction (F(4,72) = 4.82, p = 0.0061). Post-hoc t-tests showed that the normalization model was a significantly better fit than the saturation model in all ROIs (ps< 0.008, corrected).”
We have also included a figure illustrating the comparison between the normalization model and the saturation model in the paper (see the figure in the response to comment 11 of Theoretical framework and interpretation section of this letter).
The second approach we took was the simulation of responses of a neural population. We simulated three neural populations: one in which the neural response to multiple stimuli and attended stimuli was calculated based on the weighted sum model (summing neurons), one in which neural responses were calculated based on the weighted average model (averaging neurons), and one in which neural responses were calculated based on divisive normalization (normalizing neurons). Then, by randomly selecting neurons from the population to make up voxels, we estimated model parameters for each voxel using each of the three models used in our study. We then predicted the response of the left-out half of the simulated data using the parameters estimated for the first half. The results demonstrate that the normalization model is only a better fit when predicting the response of a normalizing population (figure 3—figure supplement 2). It performs worse than the weighted sum model for summing neurons and worse than the weighted average model for averaging neurons. Therefore, a nonlinear model is not necessarily a better fit for a linear operation.
We have included the simulation results in the Results section in lines 258-267:
“To ensure that the superiority of the normalization model over the weighted sum and weighted average models was not caused by the normalization model's nonlinearity or its higher number of parameters, we ran simulations of three neural populations. Neurons in each population calculated responses to multiple stimuli and attended stimuli by a summing, an averaging, and a normalizing rule (see Methods). We then used the three models to predict the population responses. Our simulation results demonstrate that despite the higher number of parameters, the normalization model is only a better fit for the population of normalizing neurons and not for summing or averaging neurons, as illustrated in figure 3—figure supplement 2. These results confirm that the better fits of the normalization model cannot be related to the model's nonlinearity or its higher number of parameters.”
We have also included the simulation details in the Simulation section in the Methods section.
In addition to these two approaches, and along with cross-validation previously used, we have also included the AIC measure for model comparison to correct for the difference in the number of parameters in the Results section in lines 234-236 (see the response to the next comment):
“Since the normalization model had more parameters, we also used the AIC measure to correct for the difference in the number of parameters. The normalization model was a better fit according to the AIC measure as well (see supplementary file 2).”
Therefore, we believe that our simulation approach and the nonlinear model we tested provide enough evidence for our conclusion; that the success of the normalization model is not due to its higher number of parameters, but rather as a result of it being a closer estimation of the computation performed at the neural level for object representation. Using AIC to compensate for the difference in the number of parameters confirms our cross-validated r-squared comparisons. Given the robustness of the agreement in the results of these analyses, we do not believe nonlinearity or the number of parameters can explain the superiority of the normalization model.
We have discussed these approaches in the Discussion section in lines 442-454:
“Here, we compared the nonlinear normalization model with two linear models with fewer free parameters. To ensure that the difference in the number of free parameters did not affect the results, we used cross-validation and the AIC measure to compare model predictions with the data. If the success of the normalization model was due to its higher number of free parameters, it would affect its predictions for left-out data. We observed that the normalization model was also successful in predicting the left-out part of the data. In addition, we tested a nonlinear model variant with five free parameters. This model was still a worse fit than the normalization model. Finally, we used simulations of three different neural populations, with neurons in each population following either a summing, averaging, or normalization rule in their response to multiple stimuli and attended stimuli. Simulation results demonstrated that the normalization model was a better fit only for the normalizing population, confirming that the success of the normalization model is not due to its nonlinearity or the higher number of parameters but rather as a result of it being a closer estimation of the computation performed at the neural level for object representation.”
Regarding the reviewer’s suggestion on testing a range of contrasts, as we have now mentioned in the Discussion section, here, we have explored the effects of attention and whether the normalization model could explain the effects of object-based attention in the human visual cortex, which has not been previously studied. While we acknowledge the importance of studying the effects of contrast, the presentation of superimposed stimuli limits the range of contrasts we can use because high-contrast stimuli block each other, and stimuli with very low contrasts are very difficult to recognize. Besides, including variations of both attentional state and contrast would need a multi-session experiment, which was not feasible given the limitations caused by the COVID pandemic.
We have added this in the Discussion section in lines 431-441:
“Attention to a stimulus has been suggested in the literature to be similar to an increase in the contrast of the attended stimulus (Ni and Maunsell 2012), which is manifested in the similar effects of attention and contrast in the normalization equation. In this study, we presented the stimuli with a constant contrast but changed the number of stimuli and their attentional state to determine whether the normalization model could explain the effects of object-based attention in the human visual cortex, which has not been previously studied. Although we acknowledge that testing for a range of contrasts would help in exploring the interaction of contrast and attention and in generalizing the conclusions of this study, it would not be trivial for the current design because the presentation of superimposed stimuli limits the range of contrasts that can be tested. Besides, including variations of both attentional state and contrast would need a multi-session fMRI experiment that would limit the feasibility of the study.”
(2) The difference in the number of free parameters biases the results. The normalization model has more free parameters than the other two models. The authors use a split-half approach to avoid the problem of overfitting, but a model with more free parameters in and of itself potentially has a better capacity to fit the data compared to the other models. This built in bias could account for the results. Model comparison (like with the AIC measure) is necessary to account for this difference.
As our simulation results show (see the response to the previous comment), a non-linear model does not necessarily fit the data better if the underlying population response follows a linear summation. Our cross-validation approach ensures that the number of parameters would not benefit us in fitting the left out data (Kay et al. 2013b). Nevertheless, in addition to predicted r-squared on the left-out data, we have now used the Akaike Information Criterion (AIC) to compare models. Results are in line with our cross-validation approach. This is not surprising given the previous literature showing that the cross-validation approach leads to similar results as the AIC approach (Fang 2011, Kay et al. 2013b). The normalization model had a smaller AIC value than both the weighted sum and the weighted average models in all ROIs.
We have included the AIC comparison results between normalization and the two other models in the Results section in lines 232-236:
“On closer inspection, the normalization model was a better fit to the data than both the weighted sum ( ps < 0.019, corrected), and the weighted average model (ps < 5.7⨯10-5, corrected) models in all ROIs. Since the normalization model had more parameters, we also used the AIC measure to correct for the difference in the number of parameters. The normalization model was a better fit according to the AIC measure as well (see supplementary file 2).”
We have included the ΔAIC values for all regions in supplementary file 2.
(3) Related to the above point, the response to the Pref (P) and Null (N) stimuli are also free parameters in the normalization model and it's not clear why the values of Rp and Rn are not used here as for the other two models? These differences could again account for the better fit observed for the normalization model. Further, what are the values of Cp and Cn in the normalization model? How are they determined?
The normalization model is different in nature from the two linear models and takes into account the suppression caused by the neighboring pool of neurons, even in the presence of a single stimulus. We cannot, therefore, use the measured response in isolated conditions as the excitatory drive in paired conditions. Rather, we need extra parameters to estimate the excitation caused by each stimulus and then use this excitation to predict the response to attended and ignored stimuli in isolated and paired conditions (Ni et al. 2012, Ni and Maunsell 2017, Ni and Maunsell 2019). On the other hand, the weighted sum and the weighted average models are not concerned with the underlying excitation and suppression. The assumption of these models is based on the resulting response that we actually measure in the paired condition, considering it to be respectively the sum or the average of the measured response in the isolated conditions. In order to take the difference in the number of model parameters into account, we have used cross-validated r-squared and AIC measures. We also tested a model with five parameters (response to point 1 of the Model section of this letter) which was a worse fit than normalization. Moreover, our simulation results (explained in point 1 as well) also showed that a model with a higher number of parameters is not a better fit when the underlying neural computation is different from the model’s computation. Given the robustness of the agreement in the results of these analyses, we do not believe the number of parameters can explain the superiority of the normalization model.
We have added an explanation of the need for Lp and LN parameters in the Methods section in lines 620-631:
“It is noteworthy that the normalization model is different in nature from the two linear models and takes into account the suppression caused by the neighboring pool even in the presence of a single stimulus. We cannot, therefore, use the measured response in isolated conditions as the excitatory drive in paired conditions. Rather, we need extra parameters to estimate the excitation caused by each stimulus. We then use this excitation to predict the response to attended and ignored stimuli in isolated and paired conditions (Ni et al. 2012, Ni and Maunsell 2017, Ni and Maunsell 2019). On the other hand, the weighted sum and the weighted average models are not concerned with the underlying excitation and suppression. The assumption of these models is based on the resulting response that we actually measure in the paired condition, considering it to be respectively the sum or the average of the measured response in the isolated conditions. In order to take the difference in the number of model parameters into account, we have used both cross-validated r-squared on independent data and AIC measures (see the section on model comparison).”
As for the contrast values, we set them to one when the stimulus was presented and to zero when the stimulus was not presented, as explained in the Methods section in lines 615-618. We have now added a sentence for a more explicit explanation:
“cP and cN are the respective contrasts of the preferred and null stimuli. Zero contrast for a stimulus denotes that the stimulus is not present in the visual field. In our experiment, we set contrast values to one when a stimulus was presented and to zero when the stimulus was not presented.”
(4) The weighted average model is confusing. Equation 5a is a strict average model. The weighted average model (Desimone and Duncan) however proposes that the response to combined stimuli is a weighted average of the response to the individual stimuli (i.e., with possibly different weights for each stimulus). Second, attention biases these weight values by possibly different amounts for different stimuli/in different brain regions. In this study, the response to the combined stimuli is modeled as a strict average (with a fixed weight of 0.5), and further, one fixed weight (β) is assigned to P and N for attention effects. However, in the weighted average model, the effect of attention could potentially be different for the preferred and null stimuli (i.e., a different β for P and another β for N), which might generate a better fit for the weighted average model. Indeed, although it is difficult to read off the graph, attention appears to enhance the response to the N stimuli more than to the P stimuli. How might the results be affected if the weights in the PN condition are not fixed to 0.5, and if attention is allowed to differently affect the two types of stimuli? A similar argument could also be made for the weighted sum model.
As you correctly specified, the weighted average model we used is a strict average model, weighted by the attentional bias. We have now added the results for other weighted average variants with different weights for the two stimuli in the absence of attention, and with different attention factors for the preferred and null stimuli.
First, to explore how unequal weights affect the fit of the weighted average model, we tested a weighted average variant with unequal weights (weighted average UW model). Comparison of the fit of this model with the weighted average model with equal weights (weighted average EW) showed that the UW variant was a significantly better fit than the EW model in all regions (ts > 3.9, ps < 4.7⨯10-3, corrected) except in LO, where it was a marginally better fit (t(18) = 2.83, p = 0.054, corrected).
In the next step, to examine the effect of unequal weights and attention parameters on the fit of the weighted sum and the weighted average models, we tested a weighted average model variant with unequal weights and unequal β parameters for the P and N stimuli (weighted average UWUB). In this variant, no constraint was put on the sum of the weights. Thus, this model is effectively a generalization of the weighted sum and the weighted average models with four parameters. This model was a better fit than the weighted average EW model in all regions (ts > 3.78, ps < 0.007, corrected) except in EBA (t(18) = 2.65, p = 0.08, corrected).
We also ran a 4⨯5 ANOVA to compare all weighted average variants with the normalization model. There was a significant effect of model (F(3,54)=89.75 , p=6.05⨯10-16), a significant effect of ROI (F(4,72)=34.97 , p=2.29⨯10-10), and a significant model by ROI interaction (F(12,216)=7.55 , p=2.5⨯10-5). Post-hoc t-tests showed that these weighted average variants were still significantly worse fits to the data than the normalization model in all regions (ps < 7.84⨯10-4, corrected) except for EBA, where the normalization model was marginally better than the weighted average UWUB ( p=0.065, corrected).
We have included the results of these model comparisons in the Results section in lines 320-344, and in figure 5a (see the figure in the response to comment 11 of Theoretical framework and interpretation section of this letter):
“The weighted average model we used in previous sections had equal weights for the preferred and null stimuli, with attention biasing the attended preferred or null stimulus with the same amount (the weighted average EW model). However, different stimuli might have different weights in the paired response depending on the neurons' preference towards the stimuli. Besides, attention may bias preferred and null stimuli differently. Therefore, to examine the effect of unequal weights and attention parameters on the fits of the weighted average model, we tested two additional variants of this model.
To examine how unequal weights affect the fit of the weighted average model, we tested the weighted average UW model. Comparison of the fit of this model with the weighted average EW model showed that the UW variant was a significantly better fit than the EW model in all regions (ts > 3.9, ps < 4.7⨯10-3, corrected) except in LO, where it was a marginally better fit (t(18) = 2.83, p = 0.054, corrected).
In the next step, to examine the effect of unequal weights and attention parameters on the fit of the weighted sum and the weighted average models, we tested the weighted average UWUB variant. This model had unequal weights and unequal attention parameters for the P and N stimuli. In this variant, no constraint was put on the sum of the weights. Thus, this model was effectively a generalization of the weighted sum and the weighted average models with four parameters. This model was a better fit than the weighted average EW model in all regions (ts > 3.78, ps < 0.007, corrected) except in EBA (t(18) = 2.65, p = 0.08, corrected).
We next compared the goodness of fit of all weighted average variants with the normalization model using a 4⨯5 ANOVA, as illustrated in figure 5a. There was a significant effect of model (F(3,54)=89.75, p=6.05⨯10-16), a significant effect of ROI (F(4,72)=34.97, p=2.29⨯10-10), and a significant model by ROI interaction (F(12,216)=7.55, p=2.5⨯10-5). Post-hoc t-tests showed that these weighted average variants were still significantly worse fits to the data than the normalization model in all regions (ps < 7.84⨯10-4, corrected) except for EBA, where the normalization model was marginally better than the weighted average UWUB (p=0.065, corrected).”
We have provided the details of these weighted average variants in the Methods section.
(5) Line 57-61: What is the justification of using the weighted sum model when we know that this model is not reflective of what is going on in the brain? It sounds like it has been set up to fail. Has the weighted sum model been shown to predict responses in prior work? It is never really brought up again in the discussion, so its purpose is unclear. Wouldn't a comparison of the weighted average and normalization model make more sense? Better yet, consider comparisons with other model(s) beyond normalization that could explain the data.
Weighted sum and weighted average are two examples of linear summation. It has been previously suggested (by Leila Reddy, 2009) that the response to multiple stimuli lies somewhere between the predictions of these two models, closer to the weighted average. Moreover, Heuer and Britten (2002) and Rubin et al. (2015) have shown that for weak stimuli and in low contrasts, the response to multiple stimuli approaches linearity and can even become supralinear. Since our stimuli were not in full contrast, we had to check for the weighted sum model as well.
In fact, based on our results, the weighted sum model performs better than the weighted average model in V1, and not significantly different from the weighted average model in pFs and PPA. Therefore, for the stimuli we used, the weighted sum model is comparable to the weighted average model in some cases.
We have added this point in the Results section in lines 199-206:
“Although many studies have demonstrated that responses to multiple stimuli are added sublinearly in the visual cortex (Heeger 1992, Reddy et al. 2009, Bloem and Ling 2019, Aqil et al. 2021), it has been suggested that for weak stimuli, response summation can approach a linear or even a supralinear regime (Heuer and Britten 2002, Rubin et al. 2015).
Since the stimuli we used in this experiment were presented in a semi-transparent form and were therefore not in full contrast, we found it probable that the response might be closer to a linear summation regime in some cases. We therefore used the weighted sum model to examine whether the response approaches linear summation in any region.”
We agree that we need to discuss weighted sum in the discussion as well. We have included why it was necessary to check the weighted sum model in the Discussion section in lines 416-430:
“Stimulus contrast has also been shown to have a crucial role in how single-stimulus responses are added to obtain the multiple-stimulus response. While responses to strong high-contrast stimuli are added sublinearly to yield the multiple-stimulus response, as predicted by the normalization model and the weighted average model, the sublinearity decreases for lower contrasts and even changes to linearity and supralinearity for weak stimuli (Heuer and Britten 2002, Rubin et al. 2015). Here, since the stimuli we used were not in full contrast, we tested the weighted sum model as well to examine whether responses approach linearity in any region. Our results demonstrate that while the weighted average model generally performs better than the weighted sum model in the higher-level occipito-temporal cortex, the weighted sum model provides better predictions in V1. These results suggest stronger sublinearity in higher regions of the visual cortex compared to V1, which is in agreement with previous reports (Kay et al. 2013b). This observation might be related to the higher sensitivity of V1 neurons to contrast (Goodyear and Menon, 1998), causing a more significant decrease in V1 responses to low-contrast stimuli. This, in turn, might make the low-contrast stimulus weaker for V1 neurons, causing a move towards a lower level of sublinearity (Sceniak et al. 1999).”
Regarding comparisons with other models beyond normalization, we have included a linear model with no constraint on the sum of weights, as the generalization of the weighted sum and the weighted average models, with its nonlinear variant with an extra saturation parameter (as explained in response to the previous point). We have included the results of all these models in the Results section, and in figure 5 (see the figure in the response to comment 11 of Theoretical framework and interpretation section of this letter).
Methods and data analysis
(1) FIGURE 1 – protocol – Is task difficulty equated across house category, body category, and color change at fixation?
As shown in figure 1—figure supplement 1 now, task difficulty (reaction time and accuracy) for single-stimulus conditions was equal across house, body, and fixation color blocks. For paired-stimulus conditions, accuracy was the same for BatH and BHat conditions, but accuracy was higher in the fixation color change block compared to the BHat condition. However, since accuracies were very high for all participants and across all blocks, this difference is unlikely to have affected our results.
The following information has been added to the Results section in lines111-117:
“Overall, average accuracy was higher than 86% in all conditions. Averaged across participants, accuracy was 94%, 89%, 86%, 93%, 94%, 96%, 95% and 96% for Bat, BatH, BHat, Hat, B, H, and BH conditions and the fixation block with no stimulus, respectively. A one-way ANOVA test across conditions showed a significant effect of condition on accuracy (F(7,126) = 8.24, p = 1.63⨯10-5) and reaction time (F(7,126) = 22.57, p = 4.52⨯10-12). As expected, post-hoc t-tests showed that this was due to lower performance in the BatH and BHat conditions (see figure 1—figure supplement 1). There was no significant difference in performance between all other conditions (ps > 0.07, corrected).”
(2) Did participants complete any practice trials for the one back task, or just went right into the scanner? Was the accuracy of the participants in the one back task recorded to ensure that they were paying attention and completing the task properly? Same for the fixation task
Yes, all participants completed one practice run before data collection. If the average accuracy was lower than 75%, they completed another practice run. We only proceeded with data collection if their performance improved to a level higher than 75%, otherwise, no data was recorded. All participants passed this test. As mentioned in response to the previous comment, the actual accuracies inside the scanner were above 86%. We have now reported average accuracy in the Results section. We have also included a figure illustrating the accuracy and reaction time for each condition in figure 1—figure supplement 1.
(3) Most of page 6 is a repetition of the methods (much copy and paste), not really needed, instead a short summary would be better or just move the methods to after the introduction.
We have summarized this section in the Results section, removing all equations except for one for each model as an example to reduce redundancy.
(4) Some of the details regarding the actual design of the fMRI experiment seem glossed over. From what I can gather, each block was only a few seconds, with no rest between blocks? If so, the BOLD response would be exceptionally driven to saturating levels throughout the experiment. That said, it is difficult to unpack the details based on the Methods section as vague detail is provided.
We apologize if the details were not clear enough. Each block started with a 1-s cue, 1-s fixation, and 8 seconds of stimulus presentation. There was an 8-s fixation period between blocks. In addition, each run started with an 8-s fixation, and there was a final 8-s fixation period after the presentation of the last block in the run. We have added the missing information about the design of the experiment and have modified the design description so that the experiment details are clear to the reader.
(5) Lines 47-51. It is unclear how preferred (P) or null (N) stimulus categories were assigned to each voxel. Does a voxel preferring houses respond greater to houses than a body? Was there a statistical measure for determining what was the P or N stimulus for a given voxel? We imagine that almost all voxels in EBA significantly preferred bodies, and almost all voxels in PPA preferred houses. If so, how is this accounted for? Was there a reliable, statistically significant difference in house or body preference for voxels in the other ROIs, or is the difference between response activation to the two categories marginal? How many voxels preferred houses vs. bodies in each ROI? For each voxel, how reliable is the attribution of the P or N stimulus (e.g., did you use a cross-validation approach and determine P/N on some runs and tested the reliability on the remaining runs)?
We assigned P and N categories to voxels based on their response in isolated house and body conditions. Therefore, for a voxel with higher response to an isolated house compared to an isolated body, house was determined as the preferred category and body as the null category. No statistical measure was used in assigning preferred and null categories to voxels. We have added a new figure in the paper, illustrating the percentage of voxels in each ROI that preferred houses and bodies (figure 2f). As you have pointed out, it is shown in the figure that most voxels in EBA and PPA preferred bodies and houses, respectively, over the other category. It is noteworthy that not all voxels in EBA are body-selective, nor are all voxels in PPA house-selective, which is due to the variability in fMRI data.
In other ROIs, the average percentages of voxel preference for bodies and houses were closer to each other. As you have suggested, we also checked voxel preference consistency in each ROI across the two halves of the data. The results demonstrate high preference consistency in all regions. We have added the result to figure 2-supplementary file 1.
(6) Were the P and N categories determined on all the data (a risk of double-dipping) or only on one half of the dataset?
P and N categories for each voxel were determined in one half of the data, and the model predictions and r-squared calculations were performed on the other half of the data. In the figures, this procedure has been done twice, once for each half of the data, and the results for the two halves were then averaged and illustrated.
(7) Provide more details of the GLM. What were the regressors?
The information is now added in the Methods section in lines 593-598:
“We performed a general linear model (GLM) analysis for each participant to estimate voxel-wise regression coefficients for each of the 8 task conditions. The onset and duration of each block were convolved with a hemodynamic response function and were entered to the GLM as regressors. Movement parameters and linear and quadratic nuisance regressors were also included in the GLM. We then used these obtained coefficients to compare the BOLD response in different conditions in each ROI.”
(8) Lines 79-81: Were the model parameters determined in the second half and predictions made in the first half as well? Would averaging the two directions provide more robust results?
We thank the reviewer for this suggestion. We have now changed the analysis to include results from both halves. The results are very similar. We have changed the figures to report the average of the results for the two halves.
We have added the information in the Results section in lines 207-211:
“To compare the three models in their ability to predict the data, we split the fMRI data into two halves (odd and even runs) and estimated the model parameters separately for each voxel of each participant twice: once using the first half of the data, and a second time using the second half of the data. All comparisons of data with model predictions were made using the left-out half of the data in each case. All model results illustrate the average of these two cross-validated predictions.”
We have also added the information in the Methods section in lines 662-666:
“We repeated this procedure twice: once using the odd half of the data for parameter estimation and the even half for comparing model predictions with the data, and a second time using the even half of the data for parameter estimation and the odd half for comparison with the model predictions. All figures, including model results, illustrate the average of the two repetitions.”
(9) Category stimuli were fit within a 10.2 degree square of visual angle. The methods describe the V1 localizer as a wedge 60 degree in angle. What was the stimulus extent (i.e. eccentricity) of the wedge? Was it also 10.2 degrees so that the V1 ROI encompassed the same amount of visual space as the category stimuli? Same question for the Category localizer. Were these stimuli also 10.2 degrees? Were the ROIs (at least for V1?) defined within the same eccentricity range so that the ROI analysis was confined to where the category stimuli were presented in the visual field?
We thank the reviewer very much for their comment about the size of the stimuli in the main experiment compared to the localizer tasks. The stimuli presented in the V1 localizer and the category localizer were presented with a diameter of 27.1 and 14.3 degrees of visual angle, respectively. We have added the information in the Methods section.
Since the localizer stimuli were larger than the stimuli used in the main experiment, we added a step to our analysis in which we selected the voxels that were significantly active during stimulus presentation compared to the fixation block (with p<0.01) in each ROI to compensate for the difference in the size of the localizer stimuli and the stimuli presented in the main experiment. This voxel selection led to a slight change in the results, most significantly in V1. The normalization model’s fit improved in V1 after this voxel selection. This procedure did not qualitatively change the results in other regions. We have added the details of this extra step to the methods section.
(10) What software was used to define LOC PF EBA and PPA? And what metric do we have to ensure that these are reliable ROIs? The authors would benefit by including figures that show examples of the ROIs overlaid on contrast data for the localizer conditions. The same goes for the β weights within the ROIs.
We used Freesurfer’s tksurfer module. We defined the ROIs for each participant individually based on the voxels that were significantly active in – the contrasts mentioned in the Methods section. The activation maps were thresholded at p<0.001. We have provided examples of the defined ROIs overlaid on contrast data for one of the participants. Since these are standard localizers, and we selected all voxels that passed the threshold as stated in the paper, we do not think adding all individual participants localizer data in the paper would be useful. We have therefore decided not to include them. If the reviewer feels strongly about this, we are happy to add them as a figure supplement to figure 1, See Author response image 1.
(11) Preprocessing – how were the data aligned?
We have added the information in the Methods section in lines 574-577:
“The data in each run were motion-corrected per-run and aligned to the anatomical data using the middle time point of that run. The fMRI data from the localizer was smoothed using a 5-mm FWHM Gaussian kernel, but no spatial smoothing was performed on the data from the main experiment to optimize the voxel-wise analyses.”
(12) The paper refers to a Supplementary Methods, which I could not find.
Thank you for catching this typo. We have corrected the sentence in line 217.
Results and statistics
(1) What was accuracy like on these tasks? There was no description of performance at all. It is necessary to provide this information to assess the attentional manipulation
We have included the following paragraph in the Results section in lines 111-117:
“Overall, average accuracy was higher than 86% in all conditions. Averaged across participants, accuracy was 94%, 89%, 86%, 93%, 94%, 96%, 95% and 96% for Bat, BatH, BHat, Hat, B, H, and BH conditions and the fixation block with no stimulus, respectively. A one-way ANOVA test across conditions showed a significant effect of condition on accuracy (F(7,126) = 8.24, p = 1.63⨯10-5) and reaction time (F(7,126) = 22.57, p = 4.52⨯10-12). As expected, post-hoc t-tests showed that this was due to lower performance in the BatH and BHat conditions (see figure 1—figure supplement 1). There was no significant difference in performance between all other conditions (ps > 0.07, corrected).”
We have also illustrated the accuracies and reaction times of all task conditions in figure 1—figure supplement 1.
(2) Line 96 – What is ts? T-statistic? Did you run a series of t-tests comparing normalization vs noise ceiling? Why not an ANOVA and correct for multiple comparisons? And what does it mean that the R2 was not significant different from noise ceiling? Interpretation of these results must be provided.
We thank the reviewer for pointing out the problem in our analysis. We have now reported the results of ANOVA and post-hoc t-tests with multiple comparison corrections in the Results section in lines 229-234:
“We first compared the goodness of fit of the three models across the five ROIs using a 3⨯5 repeated measures ANOVA. The results showed a significant main effect of model (F(2,36) = 72.9, p = 9.86⨯10-11) and ROI (F(4,72) = 26.66, p = 1.04⨯10-7), and a significant model by ROI interaction (F(8,144) = 24.96, p = 3.74⨯10-15). On closer inspection, the normalization model was a better fit to the data than both the weighted sum (ps < 0.019, corrected) and the weighted average (ps < 5.7⨯10-5, corrected) models in all ROIs.”
To answer the reviewer’s question about the noise ceiling, we defined the noise ceiling in each region separately as the r-squared of the correlation between the odd and even halves of the data. Given that the correlation between the model and the data cannot exceed the reliability of the data (as calculated by the correlation between the data from odd and even runs), the r-squared can also not exceed the squared split-half reliability. The noise ceiling (squared split-half reliability), therefore, determines the highest possible goodness of fit a model can reach.
We calculated the normalization model’s r-squared difference from the noise ceiling (NRD) for each ROI. NRD is a measure of the ability of the model in accounting for the explainable variation in the data; the lower the difference between the noise ceiling and a model’s goodness of fit, the more successful that model is in predicting the data. We ran a one-way ANOVA to test for the effect of ROI on NRD, and observed that this measure was not significantly different across ROIs (F(4,72)=0.58, p = 0.61), demonstrating that the normalization model was equally successful across ROIs.
We have included the interpretation of the noise ceiling and its relationship with the models’ goodness of fit in the Results section in lines 223-228:
“We also calculated the noise ceiling in each region separately as the r-squared of the correlation between the odd and even halves of the data. Given that the correlation between the model and the data cannot exceed the reliability of the data (as calculated by the correlation between the data from odd and even runs), the r-squared can also not exceed the squared split-half reliability. The noise ceiling (squared split-half reliability), therefore, determines the highest possible goodness of fit a model can reach.”
We have also included the result of the ANOVA test for the difference between the noise ceiling and the normalization model’s r-squared across ROIs in the Results section in lines 240-246:
“We then calculated the normalization model’s r-squared difference from the noise ceiling (NRD) for each ROI. NRD is a measure of the ability of the model in accounting for the explainable variation in the data; the lower the difference between the noise ceiling and a model’s goodness of fit, the more successful that model is in predicting the data. We ran a one-way ANOVA to test for the effect of ROI on NRD, and observed that this measure was not significantly different across ROIs (F(4,72)=0.58, p = 0.61), demonstrating that the normalization model was equally successful across ROIs in predicting the explainable variation in the data.”
(3) Lines 119-123. Again, why a t-test statistic here? Did you run multiple t-tests for each ROI? Shouldn't this be an ANOVA across the 5 ROIs? As it stands, line 120 says that you ran multiple t tests across all the regions, and then report single t-statistic? The paper needs to be clear (and correct) in the type of statistics it uses to make its claims.
We have now performed a repeated-measures 2-way ANOVA across the 5 ROIs to compare the predictions of the effects of attention by the three models, corrected for multiple comparisons in all our analyses. Results remain qualitatively similar.
We have included the results in the Results section in lines 282-291:
“To compare how closely the predictions of the three models followed the response change in the data, we calculated the difference between the response change observed in the data and the response change predicted by each model. Then, we ran a 3⨯5 repeated measures ANOVA with within-subject factors of model and ROI on the obtained difference values. The results demonstrated a significant effect of model(F(2,36) = 105.59, p = 4.98⨯10-9), a significant effect of ROI(F(4,72) = 13.88, p = 4.62⨯10-6), and a significant model by ROI interaction (F(8,144) = 28.63, p = 2.13⨯10-8). Post-hoc t-tests showed that the predictions of the normalization model were closer to the response change observed in the data in all ROIs ( ps < 5.7⨯10-7, corrected ) except in V1, where the predictions of the weighted sum and the weighted average models were closer to the data ( ps < 8.6⨯10-8, corrected ).”
And in lines 310-318:
“We calculated the difference between the asymmetry index observed in the data and the predicted index by each model and performed a 3⨯5 repeated measures ANOVA to compare the three models in how closely they predicted the asymmetry effect across ROIs using these difference values. We observed a significant effect of model (F(2,36)=185.3, p = 1.51⨯10-11), a significant effect of ROI (F(4,72) = 64.97, p = 3.71⨯10-15), and a significant model by ROI interaction (F(8,144) = 45.60, p = 8.97⨯10-17). The prediction of the normalization model was closer to the data in all regions ( ps < 1.9⨯10-7, corrected ) except for PPA, where the prediction of the weighted sum model was closer to the asymmetry observed in the data than the prediction of the normalization model ( p = 4.37⨯10-5, corrected ).”
(4) There is no assessment of the quality of the data – no Β weight maps showing the responses on the brain, or what proportion of voxels in each ROI actually responded to the stimuli. A lot of interesting data that could be shown is reduced to line plots -this comes across as the paper starting at the end, rather than providing any sense of buildup in the data analysis and the conclusions that the authors draw.
We thank the reviewer for this suggestion. We have added a new figure including the average voxel responses in each condition for each ROI and the percentage of voxels in each region that are more responsive to houses and bodies (panel f in figure 2, also pasted in response to comment 9 of the Theoretical framework and interpretation section of this letter). We think the addition of this figure helped with the clarification of the later analyses and flow of the manuscript.
(5) The authors show the goodness of fit across the 5 modeled conditions in 2C – fine, but what is the justification for lumping all the conditions together? Is there a supplement figure that teases this apart, with the goodness of fit for each model for each of the conditions, for each ROI? It would help to be comprehensive.
We apologize if the method was not clear. We defined the goodness of fit across conditions for each voxel separately (and not across voxels within each condition). Therefore it is not possible to show the goodness of fit for each condition. The reason for calculating the goodness of fit in this manner was to evaluate model fits based on their ability to predict response changes with the addition of a second stimulus and with the shifts of attention. Since correlation is blind to a systematic error in prediction for all voxels in a condition, calculating the goodness of fit across voxels would lead to misinterpretation. See figure 3—figure supplement 1 for an example figure showing how we got the goodness of fit in one voxel across all conditions.
The details of r-squared calculation were provided before in the Results and Methods sections. We have now included a reference to this figure in the Results section in lines 220-222:
“We calculated the goodness of fit for each voxel by taking the square of the correlation coefficient between the predicted model response and the respective fMRI responses across the five modeled conditions (figure 3—figure supplement 1).”
and in the Methods section in lines 669-671:
“The goodness of fit was calculated by taking the square of the correlation coefficient between the observed and predicted responses for each voxel across the five modeled conditions (figure 3—figure supplement 1).”
(6) Averaging all the data together (e.g., in Figure 2c across all conditions) does not provide a clear picture of the results. Is it possible that it is only in the attention conditions that the normalization model outperforms the others (although again, note the remark about the different numbers of free parameters in each model)? It would be more transparent to plot Figure 2c separately for the isolated/paired conditions, even though the story might become more complex.
As we pointed out in our response to the previous comment, the goodness of fit was calculated across the five conditions (as illustrated in (figure 3—figure supplement 1)). Since there is only one condition with unattended stimuli, it is not possible to calculate the goodness of fit across one condition only. Please note that we have provided the prediction of each condition separately in figure 3a-e. However, we have compared the results of the PN condition with model predictions and have added the statistics of this comparison in the Results section in lines 247-257:
“Interestingly, just focusing on the paired condition in which none of the stimuli were attended (the PN condition), the results of the weighted average model were closer to the normalization model (the gray and orange isolated data points on the subplots a-e of figure3 are similarly close to the navy point of data in some regions). For this condition, the predictions of the normalization model were significantly closer to the data compared to the predictions of the weighted average model in V1, pFs, and PPA (ps < 0.028, corrected) but not significantly closer to the data in LO and EBA (ps > 0.09, corrected). These results are in agreement with previous studies suggesting that the weighted average model provides good predictions of neural and voxel responses in the absence of attention (Zoccolan et al. 2005, Macevoy and Epstein 2009, Kliger and Yovel 2020). However, when considering all the attended and unattended conditions, our results show that the normalization model is a generally better fit across all ROIs.”
Regarding your point about the different number of parameters in the models, as explained in points 1, 2, and 4 of the Model section of this letter, we have used new models with more free parameters, AIC calculation, and simulations to show that the difference in the number of parameters cannot account for the observed success of the normalization model compared to other models.
(7) Related to the above points, the data are provided in a very synthesized manner, and raw results are not shown. For example, could we see the values of the estimated model parameters? How do they compare when determined in each half of the data (to get a sense of reliability)?
We have included the values of the estimated parameters for the even and odd runs for each model in supplementary files 3-8. As shown in the tables, the values of the estimated parameters in the two halves are very close to each other.
We have also added a figure including the average fMRI regression coefficients in each condition (figure 2), voxel preference consistency across odd and even runs (figure 2—figure supplement 1), and fMRI regression coefficients plotted separately for odd and even runs (figure 2—figure supplement 1) to provide the readers with a sense of reliability and data quality.
Figures
(1) Figure 1b: The gray squares overlap a lot so that we don't see the stimuli in each gray square.
We have reduced the overlap between the gray squares, so the stimuli are fully shown.
(2) The y-axis label in Figure 2A is not BOLD response (this would be appropriate if the figure was a BOLD time series); is this the average fMRI Β weight?
Yes, we used the fMRI Β weights. But in order to not confuse the GLM Β weights with the attention-related Β values in the models and to satisfy the reviewer’s concern, we opted to use “fMRI regression coefficient” in this plot.
(3) In Figure 2 (a and b), the connecting of data points as if they are on a continuum is not correct given their categorical nature. Moreover, the figure could be improved so that the actual data points are made more prominent, and the model predictions are dashed lines. What exactly do the error bars in Figure 2a and 2b correspond to?
Although responses in the seven conditions are discrete and not continuous, we have connected the responses in attended conditions (in which body or house stimuli were attended) and unattended conditions (in which body and house were ignored and the fixation point color was attended) separately. This was done, first, to avoid the confusion caused by illustrating too many single points in the figure and, second, to better visualize the effects of attention. As you can see in Author response image 2, removing the lines will render the plot incomprehensible.
We have added the justification for connecting data points in the Results section in lines 155-159:“Note that although the seven conditions constitute a discrete and not a continuous variable, we have connected the responses in attended conditions (in which body or house stimuli were attended) and unattended conditions (in which body and house were ignored and the fixation point color was attended) separately. This was done for visual purposes and ease of understanding. ”
Regarding your suggestion on making the data points more prominent, we have thickened the lines related to the actual data.
Error bars represent standard errors of the mean for each condition, calculated across participants after removing the overall between-subject variance. We have added the information in the figure caption.
(4) Suggestion to order the panels in Figure 2 logically (the BOLD and model data all presented in A for all 5 ROIs).
We have now ordered the panels logically according to the reviewer’s suggestion.
(5) Is there any statistic to help evaluate the data in Figure 2C? (Line 93)
We have now added the statistics in the Results section in lines 229-234:
“We first compared the goodness of fit of the three models across the five ROIs using a 3⨯5 repeated measures ANOVA. The results showed a significant main effect of model (F(2,36) = 72.9, p = 9.86⨯10-11) and ROI (F(4,72) = 26.66, p = 1.04⨯10-7), and a significant model by ROI interaction (F(8,144) = 24.96, p = 3.74⨯10-15). On closer inspection, the normalization model was a better fit to the data than both the weighted sum (ps < 0.019, corrected) and the weighted average (ps < 5.7⨯10-5, corrected) models in all ROIs.”
(6) FIGURE 3 – Why not include the BOLD response for P only and N only? That would be informative.
The BOLD response for isolated P and isolated N conditions is shown in figure 2a-e (now changed to figure 3a-e because we have added a new figure).
The purpose of figure 3 (now figure 4) is to explain the two features of attention, which is related to the conditions with attention directed to one of the stimuli. We agree that including the P and N conditions along with attended conditions would be informative, and this is already provided in figure 2 (now figure 3) for all regions. Figure 3 brings only an example to better visualize how each index is calculated.
https://doi.org/10.7554/eLife.75726.sa2Article and author information
Author details
Funding
National Institutes of Health (ZIA-MH002035)
- Maryam Vaziri-Pashkam
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
Maryam Vaziri-Pashkam was supported by NIH Intramural Research Program ZIA-MH002035.
Ethics
All participants gave written consent prior to their participation in the experiment. Imaging was performed according to safety guidelines approved by the ethics committee of the Institute for Research in Fundamental Sciences with the reference number 98/60.1/2184.
Senior Editor
- Tirin Moore, Howard Hughes Medical Institute, Stanford University, United States
Reviewing Editor
- Marisa Carrasco, New York University, United States
Version history
- Preprint posted: May 23, 2021 (view preprint)
- Received: November 20, 2021
- Accepted: April 25, 2023
- Accepted Manuscript published: April 26, 2023 (version 1)
- Accepted Manuscript updated: May 9, 2023 (version 2)
- Version of Record published: May 30, 2023 (version 3)
Copyright
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Metrics
-
- 906
- Page views
-
- 126
- Downloads
-
- 1
- Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Neuroscience
Postsynaptic mitochondria are critical for the development, plasticity, and maintenance of synaptic inputs. However, their relationship to synaptic structure and functional activity is unknown. We examined a correlative dataset from ferret visual cortex with in vivo two-photon calcium imaging of dendritic spines during visual stimulation and electron microscopy reconstructions of spine ultrastructure, investigating mitochondrial abundance near functionally and structurally characterized spines. Surprisingly, we found no correlation to structural measures of synaptic strength. Instead, we found that mitochondria are positioned near spines with orientation preferences that are dissimilar to the somatic preference. Additionally, we found that mitochondria are positioned near groups of spines with heterogeneous orientation preferences. For a subset of spines with a mitochondrion in the head or neck, synapses were larger and exhibited greater selectivity to visual stimuli than those without a mitochondrion. Our data suggest mitochondria are not necessarily positioned to support the energy needs of strong spines, but rather support the structurally and functionally diverse inputs innervating the basal dendrites of cortical neurons.
-
- Neuroscience
Several discrete groups of feeding-regulated neurons in the nucleus of the solitary tract (nucleus tractus solitarius; NTS) suppress food intake, including avoidance-promoting neurons that express Cck (NTSCck cells) and distinct Lepr- and Calcr-expressing neurons (NTSLepr and NTSCalcr cells, respectively) that suppress food intake without promoting avoidance. To test potential synergies among these cell groups we manipulated multiple NTS cell populations simultaneously. We found that activating multiple sets of NTS neurons (e.g., NTSLepr plus NTSCalcr (NTSLC), or NTSLC plus NTSCck (NTSLCK)) suppressed feeding more robustly than activating single populations. While activating groups of cells that include NTSCck neurons promoted conditioned taste avoidance (CTA), NTSLC activation produced no CTA despite abrogating feeding. Thus, the ability to promote CTA formation represents a dominant effect but activating multiple non-aversive populations augments the suppression of food intake without provoking avoidance. Furthermore, silencing multiple NTS neuron groups augmented food intake and body weight to a greater extent than silencing single populations, consistent with the notion that each of these NTS neuron populations plays crucial and cumulative roles in the control of energy balance. We found that silencing NTSLCK neurons failed to blunt the weight-loss response to vertical sleeve gastrectomy (VSG) and that feeding activated many non-NTSLCK neurons, however, suggesting that as-yet undefined NTS cell types must make additional contributions to the restraint of feeding.