1. Neuroscience
Download icon

A neural mechanism for contextualizing fragmented inputs during naturalistic vision

  1. Daniel Kaiser  Is a corresponding author
  2. Jacopo Turini
  3. Radoslaw M Cichy
  1. University of York, United Kingdom
  2. Freie Universität Berlin, Germany
  3. Goethe-Universität Frankfurt, Germany
  4. Humboldt-Universität Berlin, Germany
  5. Bernstein Center for Computational Neuroscience Berlin, Germany
Short Report
  • Cited 1
  • Views 796
  • Annotations
Cite this article as: eLife 2019;8:e48182 doi: 10.7554/eLife.48182

Abstract

With every glimpse of our eyes, we sample only a small and incomplete fragment of the visual world, which needs to be contextualized and integrated into a coherent scene representation. Here we show that the visual system achieves this contextualization by exploiting spatial schemata, that is our knowledge about the composition of natural scenes. We measured fMRI and EEG responses to incomplete scene fragments and used representational similarity analysis to reconstruct their cortical representations in space and time. We observed a sorting of representations according to the fragments' place within the scene schema, which occurred during perceptual analysis in the occipital place area and within the first 200 ms of vision. This schema-based coding operates flexibly across visual features (as measured by a deep neural network model) and different types of environments (indoor and outdoor scenes). This flexibility highlights the mechanism's ability to efficiently organize incoming information under dynamic real-world conditions.

https://doi.org/10.7554/eLife.48182.001

Introduction

During natural vision, the brain continuously receives incomplete fragments of information that need to be integrated into meaningful scene representations. Here, we propose that this integration is achieved through contextualization: the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information.

A format in which such prior knowledge about the world is represented in the brain is provided by schemata. First introduced to philosophy to explain how prior knowledge enables perception of the world (Kant, 1781), schemata were later adapted by psychology (Barlett, 1932; Piaget, 1926) and computer science (Minsky, 1975Rumelhart, 1980) as a means to formalize mechanisms enabling natural and artificial intelligence, respectively.

In the narrower context of natural vision, scene schemata represent knowledge about the typical composition of real-world environments (Mandler, 1984). Scene schemata for example entail knowledge about the distribution of objects across scenes, where objects appear in particular locations across the scene and in particular locations with respect to other objects (Kaiser et al., 2019a; Torralba et al., 2006; Võ et al., 2019; Wolfe et al., 2011).

The beneficial role of such scene schemata was first investigated in empirical studies of human memory, where memory performance is boosted when scenes are configured in accordance with the schema (Brewer and Treyens, 1981; Mandler and Johnson, 1976Mandler and Parker, 1976).

Recently however, it has become clear that scene schemata not only organize memory contents, but also the contents of perception. For example, knowledge about the structure of the world can be used to generate predictions about a scene’s content (Bar, 2009; Henderson, 2017), or to efficiently organize the concurrent representation of multiple scene elements (Kaiser et al., 2014; Kaiser et al., 2019b). This position is reinforced by behavioral studies demonstrating a beneficial role of schema-congruent naturalistic stimuli across a variety of perceptual tasks, such as visual detection (Biederman et al., 1982; Davenport and Potter, 2004; Stein et al., 2015) and visual search (Kaiser et al., 2014; Torralba et al., 2006; Võ et al., 2019).

Here, we put forward a novel function of scene schemata in visual processing: they support the contextualization of fragmented sensory inputs. If sensory inputs are indeed processed in relation to the schema context, scene fragments stemming from similar typical positions within the scene should be processed similarly and fragments stemming from different positions should be processed differently. Therefore, the neural representations of scene fragments should be sorted according to their typical place within the scene.

We tested two hypotheses about this sorting process. First, we hypothesized that this sorting occurs during perceptual scene analysis, which can be spatiotemporally pinpointed to scene-selective cortex (Baldassano et al., 2016; Epstein, 2014) and the first 250 ms of processing (Cichy et al., 2017; Harel et al., 2016). Second, given that schema-related effects in behavioral studies (Mandler and Parker, 1976) are more robustly observed along the vertical dimension, where the scene structure is more rigid (i.e., the sky is almost always above the ground), we hypothesized that the cortical sorting of information should primarily occur along the vertical dimension.

To test these hypotheses, we used a novel visual paradigm in which participants were exposed to fragmented visual inputs, and recorded fMRI and EEG data to resolve brain activity in space and time.

Results

In our study, we experimentally mimicked the fragmented nature of naturalistic visual inputs by dissecting scene images into position-specific fragments. Six natural scene images (Figure 1a) were each split into six equally-sized fragments (three vertical × 2 horizontal), resulting in 36 conditions (six scenes × 6 fragments). In separate fMRI (n = 30) and EEG (n = 20) experiments, participants viewed these fragments at central fixation while performing an indoor/outdoor categorization task to ensure engagement with the stimulus (Figure 1b). Critically, this design allowed us to investigate whether the brain sorts the fragments with respect to their place in the schema in the absence of explicit location differences (Figure 1c).

Figure 1 with 1 supplement see all
Experimental design and rationale of schema-based information sorting.

(a) The stimulus set consisted of six natural scenes (three indoor, three outdoor). Each scene was split into six rectangular fragments. (b) During the fMRI and EEG recordings, participants performed an indoor/outdoor categorization task on individual fragments. Notably, all fragments were presented at central fixation, removing explicit location information. (c) We hypothesized that the visual system sorts sensory input by spatial schemata, resulting in a cortical organization that is explained by the fragments’ within-scene location, predominantly in the vertical dimension: Fragments stemming from the same part of the scene should be represented similarly. Here we illustrate the hypothesized sorting in a two-dimensional space. A similar organization was observed in multi-dimensional scaling solutions for the fragments’ neural similarities (see Figure 1—figure supplement 1 and Video 1). In subsequent analyses, the spatiotemporal emergence of the schema-based cortical organization was precisely quantified using representational similarity analysis (Figure 2).

https://doi.org/10.7554/eLife.48182.002
Video 1
Time-resolved MDS visualization of the neural RDMs.

To directly visualize the emergence of schematic coding from the neural data, we performed a multi-dimensional scaling (MDS) analysis, where the time-resolved neural RDMs (averaged across participants) were projected onto a two-dimensional space. The RDM time series was smoothed using a sliding averaging window (15 ms width). Computing MDS solutions across time yielded a movie (5 ms resolution), where fragments travel through an arbitrary space, eventually forming a meaningful organization. Notably, around 200 ms, a division into the three vertical locations can be observed.

https://doi.org/10.7554/eLife.48182.004

To quantify the sorting of fragments during cortical processing we used spatiotemporally resolved representational similarity analysis (Cichy et al., 2014; Kriegeskorte et al., 2008). We first extracted representational dissimilarity matrices (RDMs) from the fMRI and EEG data, which indexed pairwise dissimilarities of the fragments’ neural representations (for details on RDM construction see Figure 2—figure supplement 1). In the fMRI (Figure 2a), we extracted spatially-resolved neural RDMs from scene-selective occipital place area (OPA) and parahippocampal place area (PPA), and from early visual cortex (V1) (for temporal response profiles in these regions see Figure 2—figure supplement 2). In the EEG (Figure 2b), we extracted time-resolved neural RDMs from −200 ms to 800 ms relative to stimulus onset from posterior EEG electrodes (for other electrode groups see Figure 2—figure supplements 35).

Figure 2 with 9 supplements see all
Spatial schemata determine cortical representations of fragmented scenes.

(a) To test where and when the visual system sorts incoming sensory information by spatial schemata, we first extracted spatially (fMRI) and temporally (EEG) resolved neural representational dissimilarity matrices (RDMs). In the fMRI, we extracted pairwise neural dissimilarities of the fragments from response patterns across voxels in the occipital place area (OPA), parahippocampal place area (PPA), and early visual cortex (V1). (b) In the EEG, we extracted pairwise dissimilarities from response patterns across electrodes at every time point from −200 ms to 800 ms with respect to stimulus onset. (c) We modelled the neural RDMs with three predictor matrices, which reflected their vertical and horizontal positions within the full scene, and their category (i.e., their scene or origin). (d) The fMRI data revealed a vertical-location organization in OPA, but not V1 and PPA. Additionally, the fragment’s category predicted responses in both scene-selective regions. (e) The EEG data showed that both vertical location and category predicted cortical responses rapidly, starting from around 100 ms. These results suggest that the fragments’ vertical position within the scene schema determines rapidly emerging representations in scene-selective occipital cortex. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean. In further analysis, we probed the flexibility of this schematic coding mechanism (Figure 3).

https://doi.org/10.7554/eLife.48182.005

We then quantified schema effects using separate model RDMs for horizontal and vertical locations (Figure 2c). These location RDMs reflected whether pairs of fragments shared the same location or not. We additionally constructed a category model RDM, which reflected whether pairs of fragments stemmed from the same scene or not.

Critically, if cortical information is indeed sorted with respect to scene schemata, we should observe a neural clustering of fragments that stem from the same within-scene location – in this case, the location RDM should predict a significant proportion of the representational organization in visual cortex.

To test this, we modeled neural RDMs as a function of the model RDMs using general linear models, separately for the fMRI and EEG data. The resulting beta weights indicated to which degree location and category information accounted for cortical responses in the three ROIs and across time.

The key observation was that the fragments’ vertical location predicted neural representations in OPA (t[29] = 4.12, p<0.001, pcorr <0.05), but not in V1 and PPA (test statistics for all analyses and ROIs are reported in Supplementary file 1) (Figure 2d) and between 55 ms and 685 ms (peak: t[19] = 9.03, p<0.001, pcorr <0.05) (Figure 2e). This vertical-location organization was consistent across the first and second half of the experiments (see Figure 2—figure supplement 6) and across all pairwise comparisons along the vertical axis (see Figure 2—figure supplement 7). No effects were observed for horizontal location, consistent with more rigid spatial scene structure in the vertical dimension (Mandler and Parker, 1976). This result provides a first characterization of where and when incoming information is organized in accordance with scene schemata: in OPA and rapidly after stimulus onset, scene fragments are sorted according to their origin within the environment.

The schema-based organization co-exists with a prominent scene-category organization: In line with previous findings (Lowe et al., 2018; Walther et al., 2009), category was accurately predicted in OPA (t[29] = 3.12, p=0.002, pcorr <0.05) and PPA (t[29] = 4.26, p<0.001, pcorr <0.05) (Figure 2d), and from 60 ms to 775 ms (peak: t[19] = 6.39, p<0.001, pcorr <0.05) (Figure 2e).

To efficiently support vision in dynamic natural environments, schematic coding needs to be flexible with respect to visual properties of specific scenes. The absence of vertical location effects in V1 indeed highlights that schematic coding is not tied to the analysis of simple visual features. To more thoroughly probe this flexibility, we additionally conducted three complementary analyses (Figure 3).

Figure 3 with 3 supplements see all
Schematic coding operates flexibly across visual and conceptual scene properties.

(a) To determine the role of categorization-related visual features in this schematic organization, we regressed out RDMs obtained from 18 layers along the ResNet50 DNN before repeated the three-predictor general linear model (GLM) analysis (Figure 2c). (b/c) Removing DNN features abolished category information in fMRI and EEG signals, but not vertical location information. (d) To test for generalization across different scene types, we restricted location predictor RDMs to comparisons across indoor and outdoor scenes. Due to this restriction, category could not be modelled. (e/f) In this analysis, vertical location still predicted neural organization in OPA and from 70 ms. (g) Finally, we combined the two analyses: we first regressed out DNN features prior and then modelled the neural RDMs using the restricted predictor RDMs (d). (h) In this analysis, we still found significant vertical location information in OPA. (i) Notably, vertical location information in the EEG signals was delayed to after 180 ms, suggesting that at this stage schematic coding becomes flexible to visual and conceptual attributes. Significance markers represent p<0.05 (corrected for multiple comparisons). Error margins reflect standard errors of the mean.

https://doi.org/10.7554/eLife.48182.015

First, we tested whether schematic coding is tolerant to stimulus features relevant for visual categorization. Categorization-related features were quantified using a deep neural network (DNN; ResNet50), which extracts such features similarly to the brain (Wen et al., 2018). We removed DNN features by regressing out layer-specific RDMs constructed from DNN activations (see Materials and Methods for details) (Figure 3a); subsequently, we re-estimated location and category information.

After removing DNN features, category information was rendered non-significant in both fMRI and EEG signals. When directly comparing category information before and after removing the DNN features, we found reduced category information in PPA (t[29] = 2.48, p = 0.010, pcorr <0.05) and OPA (t[29] = 1.86, p = 0.036, pcorr >0.05), and a strong reduction of category information across time, from 75 ms to 775 ms (peak t[19] = 13.0, p<0.001, pcorr <0.05). Together, this demonstrates that categorization-related brain activations are successfully explained by DNN features (Cichy et al., 2016; Cichy et al., 2017; Groen et al., 2018; Güçlü and van Gerven, 2015; Wen et al., 2018), indicating the appropriateness of our DNN for modelling visual brain activations. Despite the suitability of our DNN model for modelling categorical brain responses, vertical location still accounted for the neural organization in OPA (t[29] = 2.37, p = 0.012, pcorr <0.05) (Figure 3b) and between 75 ms and 335 ms (peak: t[19] = 5.06, p<0.001, pcorr <0.05) (Figure 3c). Similar results were obtained using a shallower feed-forward DNN (see Figure 3—figure supplement 1). This result suggests that schematic coding cannot be explained by categorization-related features extracted by DNN models.

DNN features are a useful control for flexibility regarding visual features, because they cover both low-level and high-level features, explaining variance across fMRI regions and across EEG processing time (see Figure 3—figure supplement 2; see also Cichy et al., 2016; Güçlü and van Gerven, 2015). However, to more specifically control for low-level features, we used two commonly employed low-level control models: pixel dissimilarity and GIST descriptors (Oliva and Torralba, 2001). These models neither explained the vertical location organization nor the category organization in the neural data (see Figure 3—figure supplement 3). Finally, as an even stronger control of the low-level features encoded in V1, we used the neural dissimilarity structure in V1 (i.e., the neural RDMs) as a control model, establishing an empirical neural measure of low-level features. With V1 housing precise low-level feature representations, this measure should very well capture the features extracted during the early processing of simple visual features. However, removing the V1 dissimilarity structure did neither abolish the schematic coding effects in the OPA nor in the EEG data (see Figure 3—figure supplement 3). This shows that even if we had control models that approximated V1 representations extremely well – as well as the V1 representations approximate themselves – these models could not explain vertical location effects in downstream processing. Together, these results provide converging evidence that low-level feature processing cannot explain the schematic coding effects reported here.

Second, we asked whether schematic coding operates flexibly across visually diverse situations. To test this explicitly we restricted RDMs to comparisons between indoor and outdoor scenes, which vary substantially in visual characteristics (Torralba and Oliva, 2003) (Figure 3d).

Vertical location still predicted cortical organization in OPA (t[29] = 3.05, p = 0.002, pcorr <0.05) (Figure 3e) and from 70 ms to 385 ms (peak: t[19] = 7.47, p<0.001, pcorr <0.05) (Figure 3f). The generalization across indoor and outdoor scenes indicates that schematic coding operates similarly across radically different scenes, suggesting that the mechanism can similarly contextualize information across different real-life situations.

Finally, for a particularly strong test of flexibility, we tested for schematic coding after removing both DNN features and within-category comparisons (Figure 3g). In this analysis, OPA representations were still explained by the fragments’ vertical location (t[29] = 2.38, p = 0.012, pcorr <0.05) (Figure 3h). Notably, early schema effects were rendered non-significant, while vertical location still predicted representations after 180 ms (peak: t[19] = 4.41, p<0.001, pcorr <0.05) (Figure 3i), suggesting a high degree of flexibility emerging at that time. Interestingly, across all analyses, vertical location information was exclusively found in OPA and always peaked shortly after 200 ms (see Supplementary file 2), suggesting that schematic coding occurs during early perceptual analysis of scenes.

Discussion

Together, our findings characterize a novel neural mechanism for contextualizing fragmented inputs during naturalistic vision. The mechanism exploits schemata to sort sensory inputs into meaningful representations of the environment. This sorting occurs during perceptual scene analysis in scene-selective OPA and within the first 200 ms of vision, and operates flexibly across changes in visual properties.

That schema-based coding can be localized to OPA is consistent with the region’s important role in visual scene processing. Transcranial magnetic stimulation studies suggest that OPA activation is crucial for various scene perception tasks, such as scene discrimination (Dilks et al., 2013; Ganaden et al., 2013), navigating through scenes (Julian et al., 2016) and anticipating upcoming scene information (Gandolfo and Downing, 2019). Functional MRI work suggest that computations in the OPA include the analysis of spatial scene layout (Dillon et al., 2018; Henriksson et al., 2019Lowe et al., 2017) and the parsing of local scene elements like objects and local surfaces (Kamps et al., 2016). Future studies are needed to clarify which of these computations mediate the schema-based coding described here.

As the current study is limited to a small set of scenes, more research is needed to explore whether schema-based coding generalizes to more diverse contents. It is conceivable that schema-based coding constitutes a more general coding strategy that may generalize to other visual contents (such as faces; Henriksson et al., 2015) and non-visual processing domains: when sensory information is fragmented and spatial information is unreliable, the brain may use schematic information to contextualize sensory inputs. This view is in line with Bayesian theories of perception where the importance of prior information for perceptual inference grows with the noisiness and ambiguity of the sensory information at hand (Ernst and Banks, 2002; Kersten et al., 2004).

The schema-based sorting of scene representations provides a mechanism for efficient communication between perceptual and cognitive systems: when scene information is formatted with respect to its role in the environment, it can be efficiently read out by downstream processes. This idea is consistent with the emerging view that cortical representations depend on functional interactions with the environment (Bonner and Epstein, 2017; Groen et al., 2018; Malcolm et al., 2016; Peelen and Downing, 2017). Under this view, formatting perceptual information according to real-world structure may allow cognitive and motor systems to efficiently read out visual information that is needed for different real-world tasks (e.g., immediate action versus future navigation). As the schema-based sorting of scene information happens already during early scene analysis, many high-level processes have access to this information.

Lastly, our results have implications for computational modelling of vision. While DNNs trained on categorization accurately capture the representational divide into different scene categories, they cannot explain the schema-based organization observed in the human visual system. Although this does not mean that visual features extracted by DNN models in principle are incapable of explaining schema-based brain representations, our results highlight that current DNN models of categorization do not use real-world structure in similar ways as the human brain. In the future, augmenting DNN training procedures with schematic information (Katti et al., 2019) may improve their performance on real-world tasks and narrow the gap between artificial and biological neural networks.

To conclude, our findings provide the first spatiotemporal characterization of a neural mechanism for contextualizing fragmented visual inputs. By rapidly organizing visual information according to its typical role in the world, this mechanism may contribute to the optimal use of perceptual information for guiding efficient real-world behaviors, even when sensory inputs are incomplete or dynamically changing.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional
information
Software, algorithmCoSMoMVPAOosterhof et al., 2016RRID:SCR_014519For data analysis
Software, algorithmfieldtripOostenveld et al., 2011RRID:SCR_004849For EEG data preprocessing
Software, algorithmMATLABMathworks Inc.RRID:SCR_001622For stimulus delivery and data analysis
Software, algorithmPsychtoolbox 3Brainard, 1997RRID:SCR_002881For stimulus delivery
Software, algorithmSPM12www.fil.ion.ucl.ac.uk/spm/software/spm12/RRID:SCR_007037For fMRI data preprocessing

Participants

Thirty adults (mean age 23.9 years, SD = 4.4; 26 females) completed the fMRI experiment and twenty (mean age 24.0 years, SD = 4.3; 15 females) completed the EEG experiment. All participants had normal or corrected-to-normal vision. They all provided informed consent and received monetary reimbursement or course credits for their participation. All procedures were approved by the ethical committee of the Department of Education and Psychology at Freie Universität Berlin (reference 140/2017) and were in accordance with the Declaration of Helsinki.

Stimuli

Request a detailed protocol

The stimulus set (Figure 1a) consisted of fragments taken from three images of indoor scenes (bakery, classroom, kitchen) and three images of outdoor scenes (alley, house, farm). Each image was split horizontally into two halves, and each of the halves was further split vertically in three parts, so that for each scene six fragments were obtained. Participants were not shown the full scene images prior to the experiment.

Experimental design

Request a detailed protocol

The fMRI and EEG designs were identical, unless otherwise noted. Stimulus presentation was controlled using the Psychtoolbox (Brainard, 1997; RRID:SCR_002881). In each trial, one of the 36 fragments was presented at central fixation (7° horizontal visual angle) for 200 ms (Figure 1b). Participants were instructed to maintain central fixation and categorize each stimulus as an indoor or outdoor scene image by pressing one of two buttons.

In the fMRI experiment, the inter-trial interval was kept constant at 2,300 ms, irrespective of the participant’s response time. In the EEG experiment, after each response a green or red fixation dot was presented for 300 ms to indicate response correctness; participants were instructed to only blink after the feedback had occurred. Trials were separated by a fixation interval randomly varying between 1500 ms and 2000 ms.

In the fMRI, participants performed six identical runs. Within each run, each of the 36 scene fragments was shown four times, resulting in 144 trials. Additionally, each run contained 29 fixation trials, where only the central fixation dot was shown. Runs started and ended with brief fixation periods; the total run duration was 7:30 min. In the EEG, each of the 36 fragments was presented 40 times during the experiment, for a total of 1440 trials, divided into 10 runs. Three participants performed a shorter version of the experiment, with only 20 repetitions of each image (720 trials in total).

In both experiments, participants performed very well in the indoor/outdoor categorization task (fMRI: 94% correct, 658 ms mean response time, EEG: 96%, 606 ms). Differences in task difficulty across fragments were not related to the neural effects of interest (Figure 2—figure supplement 8).

fMRI recording and preprocessing

Request a detailed protocol

MRI data was acquired using a 3T Siemens Tim Trio Scanner equipped with a 12-channel head coil. T2*-weighted gradient-echo echo-planar images were collected as functional volumes (TR = 2 s, TE = 30 ms, 70° flip angle, 3 mm3 voxel size, 37 slices, 20% gap, 192 mm FOV, 64 × 64 matrix size, interleaved acquisition). Additionally, a T1-weighted image (MPRAGE; 1 mm3 voxel size) was obtained as a high-resolution anatomical reference. During preprocessing, the functional volumes were realigned and coregistered to the T1 image, using MATLAB (RRID:SCR_014519) and SPM12 (www.fil.ion.ucl.ac.uk/spm/; RRID:SCR_014519).

fMRI region of interest definition

Request a detailed protocol

We restricted our analyses to three regions of interest (ROIs). We defined scene-selective occipital place area (OPA; Dilks et al., 2013) and parahippocampal place area (PPA; Epstein and Kanwisher, 1998) using a functional group atlas (Julian et al., 2012). As a control region, we defined early visual cortex (V1) using a probabilistic atlas (Wang et al., 2015). All ROIs were defined in standard space and then inverse-normalized into individual-participant space. For each ROI, we concatenated the left- and right-hemispheric masks and performed analyses on the joint ROI.

EEG recording and preprocessing

Request a detailed protocol

The EEG was recorded using an EASYCAP 64-channel system and a Brainvision actiCHamp amplifier. The electrodes were arranged in accordance with the standard 10–10 system. The data was recorded at a sampling rate of 1000 Hz and filtered online between 0.03 Hz and 100 Hz. All electrodes were referenced online to the Fz electrode. Offline preprocessing was performed in MATLAB, using the FieldTrip toolbox (Oostenveld et al., 2011; RRID:SCR_004849). The continuous EEG data were epoched into trials ranging from 200 ms before stimulus onset to 800 ms after stimulus onset, and baseline corrected by subtracting the mean of the pre-stimulus interval for each trial and channel separately. Trials containing movement-related artefacts were automatically identified and removed using the default automatic rejection procedure implemented in Fieldtrip. Channels containing excessive noise were removed based on visual inspection. Blinks and eye movement artifacts were identified and removed using independent components analysis and visual inspection of the resulting components. The epoched data were down-sampled to 200 Hz.

Representational similarity analysis

To model the representational structure of the neural activity related to our stimulus set, we used representational similarity analysis (RSA; Kriegeskorte et al., 2008). We first extracted neural RDMs separately for the fMRI and EEG experiments, and then used the same analyses to model their organization. To retrieve the fragments’ position within the original scene, as well their scene category, we used a regression approach, where we modeled neural dissimilarity as a linear combination of multiple predictors (Proklova et al., 2016; Proklova et al., 2019).

Constructing neural dissimilarity – fMRI

Request a detailed protocol

For the fMRI data, we used cross-validated correlations as a measure of pairwise neural dissimilarity. First, patterns for each ROI were extracted from the functional images corresponding to the trials of interest. After shifting the activation time course by 3 TRs (i.e., 6 s, accounting for the hemodynamic delay), we extracted voxel-wise activation values for each trial, from the TR that was closest to the stimulus onset on this trial (for results across 6 TRs with respect to trial onset, see Figure 2—figure supplement 2). To account for activation differences between runs, the mean activation across conditions was subtracted from each voxel’s values, separately for each run. For each ROI, response patterns across voxels were used to perform multivariate analyses using the CoSMoMVPA toolbox (Oosterhof et al., 2016; RRID:SCR_014519). For each TR separately, we performed correlation-based (Haxby et al., 2001) multi-voxel pattern analyses (MVPA) for each pair of fragments. These analyses were cross-validated by repeatedly splitting the data into two equally-sized sets (i.e., half of the runs per set). For this analysis, we correlated the patterns across the two sets, both within-condition (i.e., the patterns stemming from the two same fragments and from different sets) and between-conditions (i.e., the patterns stemming from the two different fragments and from different sets). These correlations were Fisher-transformed. Then, we subtracted the within- and between-correlations to obtain a cross-validated correlation measure, where above-zero values reflect successful discrimination. This procedure was repeated for all possible splits of the six runs. Performing this MVPA for all pairs of fragments yielded a 36 × 36 representational dissimilarity matrix (RDM) for each ROI. RDMs’ entries reflected the neural dissimilarity between pairs of fragments (the diagonal remained empty).

Constructing neural dissimilarity – EEG

Request a detailed protocol

For the EEG data, we used cross-validated classification accuracies as a measure of pairwise neural dissimilarity. We thus constructed RDMs across time by performing time-resolved multivariate decoding analyses (Contini et al., 2017). RDMs were built by computing pair-wise decoding accuracy for all possible combinations of the 36 stimuli, using the CoSMoMVPA toolbox (Oosterhof et al., 2016). As we expected the highest classification in sensors over visual cortex (Battistoni et al., 2018; Kaiser et al., 2016), only 17 occipital and posterior sensors (O1, O2, Oz, PO3, PO4, PO7, PO8, POz, P1, P2, P3, P4, P5, P6, P7, P8, Pz) were used in this analysis. We report results for other electrode groups in Figure 2—figure supplements 35. For each participant, classification was performed separately for each time point across the epoch (i.e., with 5 ms resolution). The analysis was performed in a pair-wise fashion: Linear discriminant analysis classifiers were always trained and tested on data from two conditions (e.g., the middle left part of the alley versus the top right part of the farm), using a leave-one-trial-out partitioning scheme. The training set consisted of all but one trials for each of the two conditions, while one trial for each of the two conditions was held back and used for classifier testing. This procedure was repeated until every trial was left out once. Classifier performance was averaged across these repetitions. The pairwise decoding analysis resulted in a 36-by-36 neural RDM for each time point. A schematic description of the RDM construction can be found in Figure 2—figure supplement 1.

Location and category predictors

Request a detailed protocol

We predicted the neural RDMs in a general linear model (GLM; see below) with three different predictor RDMs (36 × 36 entries each) (Figure 2c): In the vertical location RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same vertical location, or the value 1, if they stem from different vertical locations (for results with an alternative predictor RDM using Euclidean distances see Figure 2—figure supplement 9). In the horizontal location RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same horizontal location, or a value of 1, if they stem from different horizontal locations. In the category RDM, each pair of conditions is assigned either a value of 0, if the fragments stem from the same scene, or a value of 1, if they stem from different scenes.

In an additional analysis, we sought to eliminate properties specific to either the indoor or outdoor scenes, respectively. We therefore constructed RDMs for horizontal and vertical location information which only contained comparisons between the indoor and outdoor scenes. These RDMs were constructed in the same way as explained above, but all comparisons within the same scene type of scene were removed (Figure 3d).

Modelling neural dissimilarity

Request a detailed protocol

To reveal correspondences between the neural data and the predictor matrices, we used GLM analyses. Separately for each ROI (fMRI) or time point (EEG), we modelled the neural RDM as a linear function of the vertical location RDM, the horizontal location RDM, and the category RDM. Prior to each regression, the neural RDMs and predictor RDMs were vectorized by selecting all lower off-diagonal elements – the rest of the entries, including the diagonal, was discarded. Values for the neural RDMs were z-scored. Separately for each subject and each time point, three beta coefficients (i.e., regression weights) were estimated. By averaging across participants, we obtained time-resolved beta estimates for each predictor, showing how well each predictor explains the neural data over time.

Furthermore, we performed an additional GLM analysis with a vertical location predictor and a horizontal location predictor, where comparisons within indoor- and outdoor-scenes were removed (Figure 3d–f); these comparisons were also removed from the regression criterion. Using the same procedure as in the previous GLM analysis, we then estimated the beta coefficients for each predictor at each time point, separately for each subject. For this analysis, a category RDM could not be constructed, as all comparisons of fragments from the same scene were eliminated.

Controlling for deep neural network features

Request a detailed protocol

To control for similarity in categorization-related visual features, we used a deep neural network (DNN) model. DNNs have recently become the state-of-the-art model of visual categorization, as they tightly mirror the neural organization of object and scene representations (Cichy et al., 2016; Cichy et al., 2017; Cichy and Kaiser, 2019; Groen et al., 2018; Güçlü and van Gerven, 2015; Wen et al., 2018). DNNs are similar to the brain as they are trained using excessive training material while dynamically adjusting the ‘tuning’ of their connections. Here, we used a DNN (see below) that has been trained to categorize objects across a large number of images and categories, therefore providing us with a high-quality model of how visual features are extracted for efficient categorization. By comparing DNNs activations and brain responses to the scene fragments, we could quantify to which extent features routinely extracted for categorization purposes account for schema-based coding in the human visual system.

In a two-step approach, we re-performed our regression analysis after removing the representational organization emerging from the DNN. First, we used a regression model to remove the contribution of the dissimilarity structure in the DNN model. This model included one predictor for each layer extracted from the DNN (i.e., one RDM for each processing step along the DNN). Estimating this model allowed us to remove the neural organization explained by the DNN while retaining what remains unexplained (in the regression residuals). Second, we re-ran the previous regression analyses (see above), but now the residuals of the DNN regression were used as the regression criterion, so that only the organization that remained unexplained by the DNN was modeled.

As a DNN model, we used a pre-trained version (trained on image categorization for the ImageNet challenge) of the ResNet50 model (He et al., 2016), as implemented in MatConvNet (Vedaldi and Lenc, 2015). This model’s deeper, residual architecture outperforms shallower models in approximating visual cortex organization (Wen et al., 2018). ResNet50 consists of 16 blocks of residual layer modules, where information both passes through an aggregate of layers within the block, and bypasses the block; then the residual between the processed and the bypassing information is computed. Additionally, ResNet50 has one convolutional input layer, and one fully-connected output layer. Here, to not inflate the number of intercorrelated predictor variables, we only used the final layer of each residual block, and thus 18 layers in total (16 from the residual blocks, and the input and output layers). For each layer, an RDM was built using 1-correlation between the activations of all nodes in the layer, separately for each pair of conditions. For regressing out the DNN RDMs, we added one predictor for each available RDM. In Figure 3—figure supplement 1, we show that an analysis using the AlexNet architecture (Krizhevsky et al., 2012) yields comparable results; in Figure 3—figure supplement 2, we additionally provide information about the DNN model fit across regions and time points.

Statistical testing

Request a detailed protocol

For the fMRI data, we tested the regression coefficients against zero, using one-tailed, one-sample t-tests (i.e., testing the hypothesis that coefficients were greater than zero). Multiple-comparison correction was based on Bonferroni-corrections across ROIs. A complete report of all tests performed on the fMRI data can be found in Supplementary file 1. For the EEG data, we used a threshold-free cluster enhancement procedure (Smith and Nichols, 2009) to identify significant effects across time. Multiple-comparison correction was based on a sign-permutation test (with null distributions created from 10,000 bootstrapping iterations) as implemented in CoSMoMVPA (Oosterhof et al., 2016). The resulting statistical maps were thresholded at Z > 1.64 (i.e., p<0.05, one-tailed against zero). Additionally, we report the results of one-sided t-tests for all peaks effects. To estimate the reliability of onset and peak latencies we performed bootstrapping analyses, which are reported in Supplementary Items 2/3.

Data availability

Request a detailed protocol

Data are publicly available on OSF (DOI.ORG/10.17605/OSF.IO/H3G6V).

References

  1. 1
  2. 2
    The proactive brain: memory for predictions
    1. M Bar
    (2009)
    Philosophical Transactions of the Royal Society B: Biological Sciences 364:1235–1243.
    https://doi.org/10.1098/rstb.2008.0310
  3. 3
    Remembering: A Study in Experimental and Social Psychology
    1. FC Barlett
    (1932)
    Cambridge University Press.
  4. 4
    The time course of spatial attention during naturalistic visual search
    1. E Battistoni
    2. D Kaiser
    3. C Hickey
    4. MV Peelen
    (2018)
    Cortex, 10.1016/j.cortex.2018.11.018, 30563703.
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
    Scene Vision
    1. RA Epstein
    (2014)
    Neural systems for visual scene recognition, Scene Vision, MIT Press.
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    Deep residual learning for image recognition
    1. K He
    2. X Zhang
    3. S Ren
    4. J Sun
    (2016)
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778.
    https://doi.org/10.1109/cvpr.2016.90
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
    Kritik Der Reinen Vernunf
    1. I Kant
    (1781)
    Johann Friedrich Hartknoch.
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
    Stories, Scripts and Scenes: Aspects of Schema Theory
    1. JM Mandler
    (1984)
    Taylor & Francis.
  46. 46
    Some of the thousand words a picture is worth
    1. JM Mandler
    2. NS Johnson
    (1976)
    Journal of Experimental Psychology: Human Learning and Memory 2:529–540.
    https://doi.org/10.1037/0278-7393.2.5.529
  47. 47
  48. 48
    The Psychology of Computer Vision
    1. M Minsky
    (1975)
    A framework for representing knowledge, The Psychology of Computer Vision, McGraw-Hill.
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
    The Language and Thought of the Child
    1. J Piaget
    (1926)
    C Keagan Paul, editors. Trench Trubner & Co.
  54. 54
  55. 55
  56. 56
    Theoretical Issues in Reading Comprehension
    1. DE Rumelhart
    (1980)
    Schemata: the building blocks of cognition, Theoretical Issues in Reading Comprehension, CRC press.
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
    MatConvNet – convolutional neural networks for Matlab
    1. A Vedaldi
    2. K Lenc
    (2015)
    ACM International Conference on Multimedia.
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66

Decision letter

  1. Huan Luo
    Reviewing Editor; Peking University, China
  2. Joshua I Gold
    Senior Editor; University of Pennsylvania, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "A neural mechanism for contextualizing fragmented inputs during naturalistic vision" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Joshua Gold as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

This work combined fMRI, EEG and deep neural network model to investigate the neural basis of encoding of fragment location information (e.g., vertical, horizontal positions) across different scene categories. It addresses an important question about how the 'abstract' spatial layout that is not explicitly presented in the stimulus itself is represented in neural activities. The paper is clearly written and the approach and results are interesting and novel. However, there are some major issues brought up by the reviewers that need the authors to address and do additional analysis.

For a revision to be successful, you must address the following major issues:

Essential revisions:

1) The thirty-six figure segments come from only six natural images (3 indoor and 3 outdoor), which means the figure fragments would be repeatedly presented and be learned or memorized gradually. It is therefore hard to distinguish two interpretations – do the results reflect a true representation of spatial layout knowledge that would be automatically formed and could generalize to any natural images or do they derive from a learning and familiarization process after repeated exposure? A possible way to assess this is to divide the data into various stages and compare the early- and late-stage results. If the spatial layout knowledge is automatically represented regardless of learning, we would expect to see the same results in the early part. On the other hand, if it is indeed learning or memory process that induces the results, we would expect to see the pattern only in the late part but not in the early part.

2) Each scene picture was split into two halves horizontally and three parts vertically. Thus, there are confounding factors with regards to why the effect only occurred for vertical locations but not for horizontal locations. The authors should either collect new data or perform new analysis to address the issue.

3) The authors used DNN regression to confirm that the vertical position effect is not due to category-related information. However, the involvement of low-level features in discriminating vertical locations is still quite possible and could not be completely ruled out from the current analysis. For example, image segments at different vertical locations of natural scenes (upper, middle, lower) seem to be also associated with different low-level features (e.g., low spatial frequency for upper part, such as sky or ceiling, etc.). The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA.

4) It is hard to understand what exactly DNN features were removed. ResNet50 and AlexNet were used widely but these DNN models were trained by a very large set of images, whereas the present study only compares 6 specific images. The specific features to differentiate the 6 specific images may not be the same as the removed DNN features. The authors could show reconstructed images with DNN features removed and it is quite possible that human observes would still differentiate the 6 reconstructed images even when the DNN features are removed.

5) It is difficult to figure out what the authors were arguing was the mechanism or the consequence: Does knowledge/schema help sort incomplete information, or does the brain sort incomplete information so that we can extract knowledge? The paper is motivated by the former ("…the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information") but then ends by stating the latter ("This mechanism empowers the visual brain to efficiently extract meaning from dynamic real-world environments, where it is confronted with sequences of incomplete visual snapshots"). Please add clarifications.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "A neural mechanism for contextualizing fragmented inputs during naturalistic vision" for further consideration at eLife.

The manuscript has been improved but there was still one issue that was misunderstood or not clearly addressed.

In the original third comment, the reviewer pointed out that the DNN regression analysis cannot convincingly rule out the involvement of low-level properties in vertical position effect and suggested to directly test the DNN features in V1 ("The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA."). The authors have performed additional analysis by using 3 low-level control models, but again just examined whether the vertical position effect existed after regression using these models. The results indeed are not quite convincing and it seems that the regression even could not disrupt category representations as the previous DNN model showed. To explicitly address the concern, as the reviewer originally suggested, the authors should perform a new analysis to show that the DNN model does account for low-level property representation in V1, which would then support the claim that DNN regression could remove the low-level features.

https://doi.org/10.7554/eLife.48182.026

Author response

Essential revisions:

1) The thirty-six figure segments come from only six natural images (3 indoor and 3 outdoor), which means the figure fragments would be repeatedly presented and be learned or memorized gradually. It is therefore hard to distinguish two interpretations – do the results reflect a true representation of spatial layout knowledge that would be automatically formed and could generalize to any natural images or do they derive from a learning and familiarization process after repeated exposure? A possible way to assess this is to divide the data into various stages and compare the early- and late-stage results. If the spatial layout knowledge is automatically represented regardless of learning, we would expect to see the same results in the early part. On the other hand, if it is indeed learning or memory process that induces the results, we would expect to see the pattern only in the late part but not in the early part.

This is an important point. We deliberately chose few scene fragments for the experiment to compute reliable neural RDMs from multiple image presentations. We are aware of this choice posing a limitation to the current study and we now explicitly acknowledge this limitation in the Discussion section:

“As the current study is limited to a small set of scenes, more research is needed to explore whether schema-based coding generalizes to more diverse contents.”

Given the limited number of stimuli, it is in principle possible that the fragments’ location-specific representations only emerged after many stimulus repetitions across the experiment. To exclude this possibility, we performed the suggested analysis and separately analyzed data from the first and second halves of both experiments. For these halves (fMRI: first three versus last three runs, EEG: first versus second half of trials), we re-performed the main analysis for these halves. Critically, we found a very similar pattern of results with no statistical differences between the first and second half of each experiment, suggesting that the effect cannot be explained by excessive learning during the experiment. The results of this analysis are reported in Figure 2—figure supplement 6.

2) Each scene picture was split into two halves horizontally and three parts vertically. Thus, there are confounding factors with regards to why the effect only occurred for vertical locations but not for horizontal locations. The authors should either collect new data or perform new analysis to address the issue.

We now performed analyses where we analyzed all pairwise comparisons along the vertical axis (i.e., top versus bottom, top versus middle, and middle versus bottom), so that in each analysis there were only fragments from two different vertical locations. These analyses reveal a consistent vertical location organization, replicating the overall effect for each pairwise comparison, while suggesting that some comparisons (the ones including the top fragments) may contribute more to the effect. Results from these analyses are reported in Figure 2—figure supplement 7.

3) The authors used DNN regression to confirm that the vertical position effect is not due to category-related information. However, the involvement of low-level features in discriminating vertical locations is still quite possible and could not be completely ruled out from the current analysis. For example, image segments at different vertical locations of natural scenes (upper, middle, lower) seem to be also associated with different low-level features (e.g., low spatial frequency for upper part, such as sky or ceiling, etc.). The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA.

Thank you for this suggestion. Although regressing out deep neural networks should also control for low-level features to a substantial degree (previous research has shown that early DNN layers correspond well with early visual processing and activations in V1; see Cichy et al., 2016; Güclü and van Gerven, 2015), we now added new analyses where we used three additional models that explicitly control for low-level features: a pixel dissimilarity model, GIST descriptors (Oliva and Torralba, 2001), and V1 dissimilarity (as a neural approximation of low-level features). We re-performed our regression analyses after regressing out RDMs obtained from each of these low-level models. These analyses show that also the low-level models could not explain the fragments’ vertical location organization; neither could they explain their categorical organization. This indicates that schematic coding cannot be accounted for by low-level features. The new control analyses are summarized in Figure 3—figure supplement 2.

4) It is hard to understand what exactly DNN features were removed. ResNet50 and AlexNet were used widely but these DNN models were trained by a very large set of images, whereas the present study only compares 6 specific images. The specific features to differentiate the 6 specific images may not be the same as the removed DNN features. The authors could show reconstructed images with DNN features removed and it is quite possible that human observes would still differentiate the 6 reconstructed images even when the DNN features are removed.

This is an interesting point. First, we would like to point out that the reason we used categorization DNNs here was primarily to have a good measure of features that are routinely extracted for successful categorization, which includes low-level images properties as well as high-level category-defining features. It is true that the DNNs used here were trained to extract these features from large image databases, of whose properties the few images used in the study only capture a small fraction. However, it is worth noting that this qualitatively resembles processing in the visual system: The types of features the visual system routinely extracts are defined by excessive experience with large varieties of inputs, but at any single moment, we only have a very limited amount of input from which we extract a subset of these features. We now stress the reasons for using pre-trained DNNs in the Materials and methods section:

“DNNs are similar to the brain as they are trained using excessive training material while dynamically adjusting the “tuning” of their connections. […] By comparing DNNs activations and brain responses to the scene fragments, we could quantify to which extent features routinely extracted for categorization purposes account for schema-based coding in the human visual system.”

Second, we would certainly predict that after removing DNN features from the images, participants would still be able to visually discriminate between the images. We conceive of this as a feature rather than as a problem – ultimately the location information observed in brain responses must stem from visual features of the images. The key finding here is that the feature organization in DNN models, despite their similarity to visual cortex representations, cannot account for the schematic coding observed in the brain. We now bring up this point in the Discussion:

“While DNNs trained on categorization accurately capture the representational divide into different scene categories, they cannot explain the schema-based organization observed in the human visual system. Although this does not mean that visual features extracted by DNN models in principle are incapable of explaining schema-based brain representations, our results highlight that current DNN models of categorization do not use real-world structure in similar ways as the human brain.”

Finally, we agree that the current analysis does not provide information on the exact features accounted for by the DNN. Reconstruction images after controlling for DNN features could indeed provide a useful avenue for delineating the features that are uniquely extracted by the DNN and scene-selective cortex, respectively. However, we think that this is beyond the scope of the current paper, because reliably defining these features would ultimately require a larger, more diverse set of images and the acquisition of new experimental data.

5) It is difficult to figure out what the authors were arguing was the mechanism or the consequence: Does knowledge/schema help sort incomplete information, or does the brain sort incomplete information so that we can extract knowledge? The paper is motivated by the former ("…the brain uses prior knowledge about where information typically appears in a scene to meaningfully sort incoming information") but then ends by stating the latter ("This mechanism empowers the visual brain to efficiently extract meaning from dynamic real-world environments, where it is confronted with sequences of incomplete visual snapshots"). Please add clarifications.

The ending did indeed not reflect our interpretation very clearly. We re-worded the concluding paragraph to make it more consistent with the Introduction:

“To conclude, our findings provide the first spatiotemporal characterization of a neural mechanism for contextualizing fragmented visual inputs. By rapidly organizing visual information according to its typical role in the world, this mechanism may contribute to the optimal use of perceptual information for guiding efficient real-world behaviors, even when sensory inputs are incomplete or dynamically changing.”

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The manuscript has been improved but there was still one issue that was misunderstood or not clearly addressed.

In the original third comment, the reviewer pointed out that the DNN regression analysis cannot convincingly rule out the involvement of low-level properties in vertical position effect and suggested to directly test the DNN features in V1 ("The authors could add additional analysis to clarify the confounds, for example, by creating a low-level dissimilarity design matrix, which would then predict involvement of V1 for the low-level features but not for vertical location, while the reverse for OPA."). The authors have performed additional analysis by using 3 low-level control models, but again just examined whether the vertical position effect existed after regression using these models. The results indeed are not quite convincing and it seems that the regression even could not disrupt category representations as the previous DNN model showed. To explicitly address the concern, as the reviewer originally suggested, the authors should perform a new analysis to show that the DNN model does account for low-level property representation in V1, which would then support the claim that DNN regression could remove the low-level features.

Thanks for giving us the opportunity to address the remaining comment in more detail. If we understand correctly, the main worry is that the DNN models used in our study are not good models of low-level feature coding in cortex (i.e., they so not match well with V1-level representations), and thus regressing out DNN features is not a convincing way of ruling out low-level features as the explanation behind the main effect of interest, i.e. the vertical location effect.

We strongly agree that controlling for low-level features is an important issue in the current study. We see that in the previous revision we did not make sufficiently clear the comprehensive way in which we control for the effect of low-level features, DNN feature regression being only one of them. We therefore expose our perspective in detail below and made changes to the manuscript that make our argumentation more explicit.

Our pieces of evidence against low-level features behind the main effect of interest are the following. Most directly, we show that cortical area V1 does not exhibit vertical location effects. The absence of the effect of interest in the core region in cortex coding for low-level features is often taken as strong evidence against an involvement of low-level features. This thinking applies in our study as well. We make this point explicit in the current version of the manuscript:

“To efficiently support vision in dynamic natural environments, schematic coding needs to be flexible with respect to visual properties of specific scenes. The absence of vertical location effects in V1 indeed highlights that schematic coding is not tied to the analysis of simple visual features.”

Naturally, we agree that more comprehensive and corroborative evidence is desirable. Therefore, we go beyond this single piece of evidence with a series of control analyses.

First, we show that only taking visually very different indoor and outdoor scenes into account does not remove vertical location effect. Thus, it is unlikely that the vertical location effect simply reflects low-level features.

Second, we remove DNN features (or features extracted by low-level control models such as the GIST descriptor) and find that this does not abolish the vertical location effects. The reviewer questions in particular whether our DNN model accurately captures low-level feature representations. We believe that it is a fair model – we show that the DNN does explain variance in all regions examined, including V1 (Figure 3—figure supplement 2).

The reviewer further notes that the V1 data (category and vertical location effects) look somewhat comparable before and after regressing out the control models and takes this as suggesting that these are bad models of V1. However, given that we show that the DNN does explain variance in V1, this is also parsimoniously explained by there being no meaningful category or location organization in the V1 data to begin with.

That being said, we fully agree with the reviewer that although the DNN used might be a fair model of V1-level features, it is certainly not the best possible model. Thus, one could still assume that if DNNs approximated low-level feature representations in V1 more faithfully (with more variance explained), removing these DNNs' features might abolish the vertical location organization in higher-level cortex (i.e., the OPA), too.

To exclude this possibility (i.e., that it's simply a very good match to low-level feature representations in V1 that's needed to remove the OPA organization), we conducted a third control analysis. In this analysis, we used the best possible match to the neural V1 data as a low-level model – the V1 data itself! Naturally, the empirical neural V1 data explains the data in V1 (Figure 3—figure supplement 3). Crucially, after removing the neural organization in V1, the vertical location effect in OPA still persisted, showing that even if we had a model that predicted V1 exceptionally well (as well as V1 predicts itself), this model could not account for the vertical location effect.

We believe that this analysis already offers both properties that the reviewer asks for: a very good model of V1-level representation (arguably the best possible match to the neural organization in V1), and a strong vertical location effect in higher-level regions when the model is regressed out.

Together, we thus believe that the remaining open question boils down to how well DNNs can, in principle, model V1 organization. This is an interesting topic, which surely needs investigation – but from our point of view without immediate implications for the current paper. After all, we empirically show that even if the DNN perfectly captured the V1 organization, our conclusions would hold.

We laid out this argument in the revised manuscript:

“DNN features are a useful control for flexibility towards visual features, because they cover both low-level and high-level visual features, explaining variance across fMRI regions and across EEG processing time (see Figure 3—figure supplement 2; see also Cichy et al., 2016; Gücli and van Gerven, 2015). […] Together, these results provide converging evidence that low-level feature processing cannot explain the schematic coding effects reported here.”

https://doi.org/10.7554/eLife.48182.027

Article and author information

Author details

  1. Daniel Kaiser

    1. Department of Psychology, University of York, York, United Kingdom
    2. Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    danielkaiser.net@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9007-3160
  2. Jacopo Turini

    1. Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany
    2. Institute of Psychology, Goethe-Universität Frankfurt, Frankfurt am Main, Germany
    Contribution
    Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Radoslaw M Cichy

    1. Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany
    2. Berlin School of Mind and Brain, Humboldt-Universität Berlin, Berlin, Germany
    3. Bernstein Center for Computational Neuroscience Berlin, Berlin, Germany
    Contribution
    Conceptualization, Supervision, Funding acquisition, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared

Funding

Deutsche Forschungsgemeinschaft (KA4683/2-1)

  • Daniel Kaiser

Deutsche Forschungsgemeinschaft (CI241/1-1)

  • Radoslaw M Cichy

Deutsche Forschungsgemeinschaft (CI241/3-1)

  • Radoslaw M Cichy

H2020 European Research Council (ERC-2018-StG 803370)

  • Radoslaw M Cichy

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

DK and RMC are supported by Deutsche Forschungsgemeinschaft (DFG) grants (KA4683/2-1, CI241/1-1, CI241/3-1). RMC is supported by a European Research Council Starting Grant (ERC-2018-StG).

Ethics

Human subjects: All participants provided informed written consent. All procedures were approved by the ethical committee of the Department of Education and Psychology at Freie Universität Berlin (reference 140/2017) and were in accordance with the Declaration of Helsinki.

Senior Editor

  1. Joshua I Gold, University of Pennsylvania, United States

Reviewing Editor

  1. Huan Luo, Peking University, China

Publication history

  1. Received: May 3, 2019
  2. Accepted: October 8, 2019
  3. Accepted Manuscript published: October 9, 2019 (version 1)
  4. Version of Record published: October 21, 2019 (version 2)

Copyright

© 2019, Kaiser et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 796
    Page views
  • 155
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)