1. Neuroscience
Download icon

Neural dynamics of visual ambiguity resolution by perceptual prior

  1. Matthew W Flounders
  2. Carlos González-García
  3. Richard Hardstone
  4. Biyu J He  Is a corresponding author
  1. New York University Langone Medical Center, United States
  2. Ghent University, Belgium
Research Article
  • Cited 1
  • Views 1,548
  • Annotations
Cite this article as: eLife 2019;8:e41861 doi: 10.7554/eLife.41861

Abstract

Past experiences have enormous power in shaping our daily perception. Currently, dynamical neural mechanisms underlying this process remain mysterious. Exploiting a dramatic visual phenomenon, where a single experience of viewing a clear image allows instant recognition of a related degraded image, we investigated this question using MEG and 7 Tesla fMRI in humans. We observed that following the acquisition of perceptual priors, different degraded images are represented much more distinctly in neural dynamics starting from ~500 ms after stimulus onset. Content-specific neural activity related to stimulus-feature processing dominated within 300 ms after stimulus onset, while content-specific neural activity related to recognition processing dominated from 500 ms onward. Model-driven MEG-fMRI data fusion revealed the spatiotemporal evolution of neural activities involved in stimulus, attentional, and recognition processing. Together, these findings shed light on how experience shapes perceptual processing across space and time in the brain.

https://doi.org/10.7554/eLife.41861.001

Introduction

Perception reflects not only immediate patterns of sensory inputs but also memories acquired through prior experiences with the world (Helmholtz, 1924; Albright, 2012). For instance, reading handwriting greatly depends on our existing knowledge of vocabulary and grammar. Stored representations from previous experiences provide likely interpretations of sensory data about their cause and meaning, overcoming the ever-present noise, ambiguity and incompleteness of retinal image. However, to date neural mechanisms underlying prior experiences’ influence on visual perception remain largely unknown.

Here, we adopted the Mooney image paradigm to investigate visual perception of identical sensory input that results in distinct perceptual outcomes depending on whether or not prior knowledge is present. These Mooney images are degraded, two-tone images created from natural photographs of objects and animals. Even after multiple presentations, the content of these images typically remains unrecognizable. However, once exposed to the corresponding non-degraded original photograph, subjects effortlessly recognize the Mooney image in future presentations – a disambiguation effect that lasts for days, months, even a lifetime (Ludmer et al., 2011; Albright, 2012). This phenomenon illustrates that a prior that guides perception can be established in a remarkably fast and robust manner. Thus, Mooney images offer an experimentally controlled paradigm for dissecting how prior experience shapes perceptual processing.

Previous neuroimaging studies have observed that disambiguation of Mooney images induces widespread activation and enhanced image-specific information in both visual and frontoparietal cortices (Dolan et al., 1997; Hegdé and Kersten, 2010; Hsieh et al., 2010; Gorlin et al., 2012; van Loon et al., 2016; González-García et al., 2018). However, due to the slow temporal resolution of these techniques (positron emission tomography and fMRI), the temporal dynamics underlying this effect remains unknown. This is an important open question because behavioral studies have shown that recognition of Mooney images, even after disambiguation, is slow – with reaction times at around 1.2 s (Hegdé and Kersten, 2010). By contrast, neural dynamics underlying recognition of intact, unambiguous images, as well as scene-facilitation of object recognition, typically conclude within 500 ms (Carlson et al., 2013; van de Nieuwenhuijzen et al., 2013; Kaiser et al., 2016; Brandman and Peelen, 2017). Together with a recent finding of altered content-specific neural representations in frontoparietal regions following Mooney image disambiguation (González-García et al., 2018), these observations raise the intriguing possibility that slow (taking longer than 500 ms), long-distance recurrent neural dynamics involving large-scale brain networks are necessary for prior-experience-guided visual recognition.

Several previous EEG and MEG studies reported disambiguation-induced decrease in beta-band power and increase in gamma-band power (Grützner et al., 2010; Minami et al., 2014; Moratti et al., 2014), but these effects could potentially be attributed to non-content-specific effects such as increased attention, salience or decreased task difficulty following disambiguation. To unravel neural mechanisms underlying prior experience’s influence on perception, an important unanswered question is how different information processing stages are dynamically encoded in neural activities.

Here, we probe the dynamical encoding of perceptual state (before or after disambiguation) as well as the physical features and recognition outcomes related to individual images in neural activities, using multivariate pattern decoding and representational similarity analysis (RSA) applied to whole-head MEG data. In addition, to illuminate the anatomical distribution of the evolving neural dynamics involved in different information processing stages, we applied model-driven cross-modal RSA (Hebart et al., 2018) to combine the high temporal resolution of MEG data with high spatial resolution of 7T fMRI data collected using a similar paradigm. These approaches allowed us to spatiotemporally resolve neural dynamics underlying different stages of information processing during prior-guided visual perception.

Results

Paradigm and behavioral results

Eighteen subjects were shown 33 Mooney images containing animals or manmade objects. Each Mooney image was presented six times before its corresponding grayscale image was shown to the subject, and six times after. Following each Mooney image presentation, subjects reported whether they could recognize the image using a button press (‘subjective recognition’). Each MEG ‘run’ (Figure 1; for details see Materials and methods, Task paradigm) included three different grayscale images, their corresponding post-disambiguation Mooney images, and three new Mooney images shown pre-disambiguation (their corresponding grayscale images would be shown in the next run). To ensure that subjects’ self-reported recognition matched the true content of the Mooney images, at the end of each run, Mooney images presented during that run were shown again and participants were asked to verbally report what they saw in the image and were allowed to answer ‘unknown’. This resulted in a verbal test for each Mooney image once before disambiguation and once after disambiguation (‘verbal identification’). Verbal responses were scored as correct or incorrect using a pre-determined list of acceptable responses for each image. In addition, six Mooney images were presented with non-matching grayscale images using identical block and run structure, which served as controls for the effect of repetition (‘catch image sets’, as opposed to the 33 ‘real image sets’ described earlier).

Task paradigm.

(A) Trial structure. The left/right position (and corresponding button) for ‘Yes’/‘No’ answer was randomized from trial to trial. (B) Block and run structure. Each block includes 15 trials: three grayscale images, six Mooney images in a randomized order, then a repeat of these six Mooney images in a randomized order. Three of the six Mooney images correspond to the grayscale images presented in the same run and are presented post-disambiguation. The other three Mooney images are presented pre-disambiguation, and their matching grayscale images will be shown in the following run. An experimental run consists of a block presented three times with randomized image order, followed by a verbal test (for details, see Materials and methods). Mooney images were not presented to subjects with colored frames.

https://doi.org/10.7554/eLife.41861.002

Viewing the corresponding grayscale image had a dramatic effect on Mooney image recognition as shown by the following results. First, we compared recognition rates between pre- and post-disambiguation stages using a two-way repeated-measures ANOVA [factors: presentation stage and image-set type (real vs. catch)]. This analysis was carried out separately using subjective recognition rates (pooled across six presentations) and correct verbal identification rates as the dependent variable (Figure 2A). There was a significant main effect of presentation stage (pre- vs. post-disambiguation) on both subjective recognition (Figure 2A, left, F1,68 = 42.6, p=1.0 × 10−8, η2p = 0.38) and verbal identification (Figure 2A, right, F1,68 = 60.7, p=5.2 × 10−11, η2p = 0.47). Crucially, there was also a significant interaction effect of presentation stage × image set type on both subjective recognition (F1,68 = 16.4, p=1.0 × 10−4, η2p = 0.19) and verbal identification (F1,68 = 38.5, p=3.7 × 10−8, η2p = 0.36), suggesting that the effect of disambiguation by viewing the corresponding grayscale image significantly exceeds that induced by repetition.

Disambiguation-related behavioral effects.

(A) Subjective recognition rates (left, averaged across six presentations) and correct verbal identification rates (right) for Mooney images presented in the pre- and post- disambiguation period (gold and teal bars). Corresponding results for catch images are shown in open bars. p-Values corresponding to the interaction effect (pre vs. post × catch vs. non-catch) of two-way ANOVAs are shown in the graph. (B) Subjective recognition rates grouped by presentation number in the pre- and post- disambiguation stage (gold and teal), as well as the corresponding rates for catch images (white lines). (C) Distribution of subjective recognition rates across Mooney images in the pre- (gold bars, left) and post- (teal bars, right) disambiguation stage. Corresponding distributions for catch images are shown as open bars. All results show mean and s.e.m. across subjects.

https://doi.org/10.7554/eLife.41861.003

Second, given that each Mooney image was presented six times before and six times after disambiguation with subjective recognition probed each time, we assessed the effect of repetition on Mooney image recognition. To this end, we conducted a three-way repeated-measures ANOVA [factors: presentation number (1 – 6), presentation stage (pre- vs. post-disambiguation), and image-set type (real vs. catch)] on subjective recognition rates (Figure 2B). No main or interaction effect involving presentation number was significant (all p>0.2). However, consistent with the previous analysis, there was a highly significant interaction effect of presentation stage ×image set type (F1,408 = 85.2, p=1.5 × 10−18, η2p = 0.17). These results suggest that viewing the corresponding grayscale image, but not repeated viewing of the Mooney image, significantly facilitates Mooney image recognition.

Last, we examined the distribution of subjective recognition rates across individual Mooney images and subjects. The distribution of subjective recognition rates was separately plotted for pre- (Figure 2C, left) and post- (Figure 2C, right) disambiguation images, and for real (colored bars) and catch (empty bars) image sets. A bimodal distribution is observed in both stages. Accordingly, we defined those images recognized two or fewer times as ‘not-recognized’, and those recognized four or more times as ‘recognized’ (rectangles in Figure 2C). Based on these criteria, for real image sets, 60.9 ± 4.2% are not recognized and 34.0 ± 3.4% are recognized in the pre-disambiguation stage, whereas only 10.3 ± 2.4% are not recognized and 87.4 ± 2.5% are recognized in the post-disambiguation stage.

Together, these results show the robustness of the Mooney image disambiguation effect, whereby having seen a related unambiguous image dramatically facilitates recognition for the degraded Mooney image. Importantly, in our paradigm a grayscale image is rarely followed by its corresponding Mooney image in the immediate next trial due to the block structure and the shuffling of image sequence (Figure 1B). This design and the long trial length (on average 8.5 s, see Figure 1A) ensured that the disambiguation effect cannot be driven by low-level priming (also see Chang et al., 2016).

Disambiguation changes neural activity patterns elicited by Mooney images

To dissect the neural dynamics underlying the resolution of visual ambiguity by prior experience, we first investigated the neural activities that distinguish between pre- and post- disambiguation Mooney images regardless of image identity, using time-resolved multivariate decoding applied to whole-brain MEG data. This analysis was carried out using two frequency bands: slow cortical potentials (SCPs, 0.05–5 Hz) that were recently shown to be involved in conscious vision (Li et al., 2014; Baria et al., 2017) and the classic event-related field (ERF, DC – 35 Hz) band that has been used to decode stimulus features under seen and unseen conditions (Salti et al., 2015; King et al., 2016). Using the 33 real image sets, we constructed three classifiers: First, all 33 unique Mooney images are included, and presentation stage (pre- vs. post-disambiguation) was decoded (Figure 3A–B, green). Second, because a small number of Mooney images (34.0 ± 3.4%) were spontaneously recognized pre-disambiguation, and 10.3 ± 2.4% of Mooney images remained unrecognized in the post-disambiguation period, we performed a behaviorally constrained analysis to more precisely target the disambiguation effect: for each subject, Mooney images that were both not-recognized (/not-identified) pre-disambiguation and recognized (/identified) post-disambiguation are selected. This resulted in 16.0 ± 1.4 (mean ±s.e.m. across subjects) real image sets based on subjective recognition (or 17.5 ± 1.0 real image sets based on verbal identification). Using these image sets, the presentation stage (pre- vs. post- disambiguation) was decoded (Figure 3A–B, blue and magenta). In a control analysis, the presentation stage of the catch image sets was also decoded, regardless of behavioral responses (Figure 3A–B, black traces).

Decoding perceptual state information.

(A) Decoding accuracy using SCP (0.05–5 Hz) activity across all sensors. Classifiers were constructed to decode i) presentation stage for all 33 Mooney images (Pre vs. Post; green); ii) presentation stage using only Mooney images that are not recognized pre-disambiguation and recognized post-disambiguation (Disambiguation, subjective recognition; blue); iii) presentation stage using only Mooney images that are not identified pre-disambiguation and identified post-disambiguation (Disambiguation, verbal identification; magenta); iv) ‘presentation stage’ for catch images, where the grayscale images did not match the Mooney images (black). Shaded areas represent s.e.m. across subjects. Horizontal bars indicate significant time points (p<0.05, cluster-based permutation tests). (B) Same as A, but for ERF (DC – 35 Hz) activity. (C) Activation patterns transformed from decoder weight vectors of the ‘Disambiguation, subjective recognition’ decoder constructed using SCP (top row) and ERF (bottom row) activity, respectively. (D) Left: TGM for the ‘Disambiguation, subjective recognition’ decoder constructed using SCP. Significance is outlined by the dotted black trace. Right: Cross-time generalization accuracy for classifiers trained at five different time points (marked as red vertical lines; blue traces, corresponding to rows in the TGM). The within-time decoding accuracy (corresponding to the diagonal of the TGM and blue trace in A) is plotted in black for comparison. Solid traces show significant decoding (p<0.05, cluster-based permutation test); shaded areas denote significant differences between within- and across- time decoding (p<0.05, FDR corrected). The black vertical bars in A, B, and D denote onset (0 s) and offset (2 s) of image presentation.

https://doi.org/10.7554/eLife.41861.004

With SCP activity (Figure 3A), the classifier using real image sets performed significantly above chance level at every time point from 300 ms onward (green trace and bar, p<0.05, cluster-based permutation test). Classifiers based on behavioral performance performed similarly (blue and magenta). By contrast, the classifier applied to catch image sets yielded chance-level accuracy throughout the trial epoch (black). A control analysis showed that decoding of presentation stage based on six randomly selected real image sets to match the statistical power of catch image sets still yielded sustained significant decoding from 300 ms onward (figure not shown). All the results are qualitatively similar but, as expected, with more high-frequency fluctuations, when the classifiers were constructed using the ERF band (Figure 3B).

In order to shed light on the neural activity contributing to successful decoding, we performed activation pattern transform on the classifier weights (Haufe et al., 2014). Activation patterns corresponding to the SCP and ERF classifiers based on subjective recognition responses (Figure 3A–B, blue traces) are plotted in Figure 3C, which reveal strong frontotemporal contributions starting within 400 ms after stimulus onset (note that both large positive and large negative values contribute strongly to the classifier). Later, we will probe the spatiotemporal evolution of relevant neural activities more precisely using model-driven MEG-fMRI data fusion.

Cross-time generalization of decoding perceptual states

The above decoding analysis assesses whether there is separable neural activity pattern information between perceptual states at each time point following stimulus onset. To understand how this pattern information evolves over time, we next investigated decoder cross-time generalization (King and Dehaene, 2014). Here, a classifier trained at a given time point is tested on other time points in the trial epoch, and the extent to which it generalizes to another time point reveals how similar or different the underlying neural activity patterns are between the two time points. The temporal generalization matrix (TGM) corresponding to the subjective recognition classifier constructed with the SCP band (Figure 3A, blue trace) is shown in Figure 3D, left panel. Results obtained using the ‘Pre vs. Post’ classifier and the verbal identification classifier, or using the ERF band, were qualitatively similar (not shown). Significant generalization (p<0.05, cluster-based permutation test, outlined by the dashed trace) was obtained in a large temporal cluster that includes a wide diagonal beginning at ~300 ms, and a square shape from ~800 ms until the end of the trial epoch. The cross-time generalization accuracies of five classifiers trained at 400 ms intervals are shown as blue traces in Figure 3D, right panel, as compared with the within-time decoding accuracy, shown as black traces. For the classifier trained at 400 ms (bottom row), its cross-time generalization time course departs significantly from the within-time decoding time course (shown as shaded areas, p<0.05, paired t-tests, FDR corrected). By contrast, for the classifiers trained at 800 ms or later, their cross-time generalization accuracies closely tracked the within-time decoding accuracy. Together, these results suggest evolving patterns of neural activity until ~800 ms after stimulus onset and thereafter sustained activity patterns that distinguish pre- and post-disambiguation perceptual states.

Disambiguation increases across-image dissimilarity in neural dynamics

The above results show separation in neural activity patterns between Mooney images presented in the pre- vs. post-disambiguation period, when the same physical images elicit distinct perceptual outcomes (meaningless blobs vs. recognizable animals or objects). However, these results do not necessarily suggest image-content-specific processing: successful decoding of perceptual state may also reflect non-content-specific processing such as heightened attention and salience or reduced task difficulty associated with post-disambiguation images. To probe content-specific neural processing, we next turned to RSA (Kriegeskorte et al., 2008a) which allows for individual-image-level comparisons of neural activity patterns. At every time point in the trial epoch, we constructed a representational dissimilarity matrix (RDM), where each element contains the neural dissimilarity (quantified as 1 – Pearson’s r, computed over all sensors, see Materials and methods, RSA) between two individual images (Figure 4A). Neural dissimilarity was calculated both for different images presented in the same condition (e.g., the Pre-Pre square of the RDM), and for (same or different) images presented in different conditions (e.g., the Pre-Post square of the RDM).

Figure 4 with 2 supplements see all
Image-level representations are influenced by prior information.

(A) Schematic for RSA. See Materials and methods, RSA, for details. (B) Group-average MEG representation dissimilarity matrices (RDMs) at selected time points. (C) Schematics for analyses shown in D and E. (D) Mean across-image representational dissimilarity in each perceptual condition, calculated by averaging the elements in the upper triangles for each condition in the RDM (see C-i). Horizontal bars denote significant differences between conditions (p<0.05, cluster-based permutation tests). (E) Results from the intra-RDM analysis showing time courses of neural activity related to ‘stimulus-based’ and ‘recognition-based’ representation, obtained by performing element-wise correlations between the Pre and Post triangles in the RDM, and between the Post and Gray triangles, respectively (see C-ii). Correlation values were Fisher-z-transformed. Horizontal bars denote significance for each time course (p<0.05, cluster-based permutation tests). Shaded areas in D and E show s.e.m. across subjects.

https://doi.org/10.7554/eLife.41861.005

Group-average RDMs at five different time points are shown in Figure 4B. Two patterns can be seen: First, at 300–600 ms following stimulus input, dissimilarity across all image pairs is relatively low, presumably driven by visual-evoked activity that has a similar gross spatial pattern for all images. Second, after 600 ms, dissimilarity between individual Mooney images in the pre-disambiguation period (Pre-Pre square of the RDM) is visibly lower than between post-disambiguation Mooney images (Post-Post square) or between grayscale images (Gray-Gray square). We quantified these effects by averaging across elements within the upper triangles of the Pre-Pre, Post-Post and Gray-Gray squares of the RDM (Figure 4C–i) and plotting the mean across-image dissimilarity for each perceptual condition across time (Figure 4D). In all perceptual conditions, neural dissimilarity between individual images decreases sharply following stimulus onset and reaches the trough around 400 ms. However, from ~500 ms onward, across-image dissimilarity was substantially lower for pre-disambiguation Mooney images (gold) than for post-disambiguation Mooney images (teal) or grayscale images (black), and this difference is significant from ~1 s to the end of the trial epoch (p<0.05, cluster-based permutation tests). The difference between post-disambiguation Mooney images and gray-images was not significant at any time point. A similar analysis applied to catch image sets did not yield any significant difference between pre- and post-disambiguation stages, or between Mooney and grayscale images (Figure 4—figure supplement 1).

To ensure that the increase of across-image dissimilarity following disambiguation was not driven by increased noise in the data, we re-conducted the analysis shown in Figure 4D using cross-validated Euclidean distance, a distance metric that is unaffected by changing levels of noise and only captures the signal component that is shared between different partitions of the data set (see Materials and methods, Euclidean Distance). The results, shown in Figure 4—figure supplement 2, reproduce the findings in Figure 4D and confirm that the increased across-image dissimilarity following disambiguation was not driven by a changing level of noise between perceptual conditions.

Together, these results suggest that, even though Mooney images presented in the two task stages are physically identical, following the acquisition of perceptual priors, different Mooney images are represented much more distinctly in neural dynamics, in fact as distinctly as the grayscale images.

This dramatic effect raises two important questions: 1) Is this effect driven by the neural representations of Mooney images shifting towards those of their respective grayscale counterparts? 2) Does this effect reflect image-content-specific processing at the single-trial level? To answer these questions, we next probe how neural representation for a particular Mooney image changes following disambiguation at the single-trial level.

Comparing image-specific dynamic neural representations across perceptual conditions at the single-trial level

Here, we assess how disambiguation changes dynamical neural representations by analyzing single-trial separability of neural activity patterns elicited by the same (or matching Mooney-grayscale) image presented in different conditions. To this end, we used a measure akin to single-trial decoding to quantify separability (at the single-trial level) of neural activities elicited by two images (e.g. Images A and B in the same condition or different conditions, or Image A in different conditions). This measure calculates how much neural similarity across multiple presentations of the same image exceeds neural similarity between the two different images (i.e. rwithin − rbetween, Figure 5A–i, for details see Materials and methods, Single-trial separability). Unlike the 1 − r distance measure used in the previous analysis (Figure 4), this metric takes into account the within-image, across-trial variability: for instance, in the two examples given in Figure 5A–ii and 1 − r distance is identical, but single-trial separability (rwithin − rbetween) is higher in the top example.

Single-trial separability analysis.

(A) i: Schematic for separability calculation. For details, see Materials and methods, Single-trial separability. ii: Two hypothetical examples of single-trial neural activity patterns (projected to a 2-D plane) for Image A (black dots) and Image B (gray dots). The neural dissimilarity calculated based on trial-averaged activity patterns (1 – r measure used earlier) is identical between the two examples, while single-trial separability (rwithin - rbetween) is higher in the top example. (B) Group-average MEG RDMs computed with the separability (rwithin - rbetween) measure at selected time points. (C) Quantifying separability between neural activity patterns elicited by the same/matching image presented in different conditions. (i) Analysis schematic: diagonal elements in the between-condition squares of the RDM are averaged together, yielding three time-dependent outputs corresponding to the three condition-pairs. (ii) Separability time courses averaged across 33 real image sets for each between-condition comparison, following the color legend shown in C-i. The top three horizontal bars represent significant (p<0.05, cluster-based permutation test) time points of each time course compared to chance level (0); and the bottom three bars represent significance of pairwise comparisons between the time courses. (D) Quantifying the difference between off-diagonal and diagonal elements in the between-condition squares of the separability RDMs. Intuitively, this analysis captures how similar an image is to itself or its matching version presented in a different condition over and above its similarity to other images presented in that same condition. Statistical significance (p<0.05, cluster-based permutation test) for pairwise comparisons are shown as horizontal bars. When compared to chance (0), the three traces are significant from 40 ms (Pre-Post), 50 ms (Pre-Gray), and 60 ms (Post-Gray) onward until after image offset, respectively. Traces in C-ii- and D-ii show mean and s.e.m. across subjects.

https://doi.org/10.7554/eLife.41861.008

We reconstructed the RDMs using the separability metric calculated across all MEG sensors. As shown in Figure 5B, there are darker diagonals in the between-condition squares: for example, in the Pre-Post square at 100 ms, and in the Post-Gray square at 800 ms. This suggests that single-trial separability is lower between the same/matching image presented across conditions than between different images presented across conditions. Below we evaluate how disambiguation alters neural representation of the same image using two quantitative analyses applied to these RDMs.

First, we extracted the mean of diagonal elements within each between-condition square of the RDM (Figure 5C–i), which quantifies how separable neural activity patterns are between the same Mooney image presented before and after disambiguation (Pre-Post), between a pre-disambiguation Mooney image and its matching grayscale image (Pre-Gray), and between a post-disambiguation Mooney image and its matching grayscale image (Post-Gray). The result, plotted in Figure 5C–ii, shows that pre-disambiguation Mooney images are well separable from their matching grayscale images from 50 ms until the end of trial epoch (orange trace and significance bar, p<0.05, cluster-based permutation test). By contrast, post-disambiguation Mooney images are separable from their matching grayscale images only in a short window – from 50 to 530 ms (Figure 5C–ii, dark blue). Strikingly, neural representation of a post-disambiguation image is entirely indistinguishable from that of its matching grayscale image from ~550 ms onward (as shown by separability fluctuating around chance level; Figure 5C–ii, dark blue), despite the difference in the stimulus input. The separability between the same Mooney image presented before and after disambiguation starts to increase after stimulus onset, and reaches significance at 270 ms, which remains significant until the end of trial epoch (Figure 5C–ii, green). These results suggest that early (<300 ms) neural activity patterns distinguish between different physical stimulus inputs – Mooney vs. grayscale images, while late (>500 ms) neural activity patterns distinguish between recognizing the image content and failing to do so.

Second, we quantified the strength of between-condition diagonals (i.e. separability between the same/matching image presented across conditions) as compared to off-diagonal elements in the same between-condition squares (i.e. separability between different images presented across conditions), by computing the mean for each and calculating the difference between them (Figure 5D–i). This metric captures how similar an image is represented to itself or its matching version presented in a different condition over and above its similarity to other images presented in that same condition. The comparison between the Post-Gray traces in Figure 5C–ii and Figure 5D–ii (dark blue) reveals a striking effect after 550 ms: although neural activity patterns elicited by a post-disambiguation image are indistinguishable from those elicited by its matching grayscale image, they are well separable from other grayscale images, suggesting an image-specific shift in neural representation toward the relevant prior experience that guides perception. The difference between off-diagonal and diagonal separability is larger for the Pre-Post square than the Pre-Gray square from 20 to 520 ms (Figure 5D–ii, magenta), and larger for the Post-Gray square than the Pre-Gray square from 380 ms to 1.3 s (Figure 5D–ii, cyan).

Together, these results provide strong evidence that early (<300 ms) neural dynamics carry stimulus-content-specific processing and late (>500 ms) neural dynamics carry recognition-content-specific processing.

Temporal separation of stimulus- and recognition-related neural representations

To further shed light on neural mechanisms underlying different information processing stages involved in prior-guided visual disambiguation, we investigated how the representational geometry (i.e. the set of representational distances between image-pairs) compares between perceptual conditions (for details see Materials and methods, Intra-RDM analysis). We reasoned that because the same set of Mooney images are presented in the Pre and Post stages, and they are ordered in the same sequence within the Pre-Pre and Post-Post squares of the RDM, neural activity reflecting stimulus-feature processing should exhibit a similar pattern between Pre-Pre and Post-Post squares of the RDM, and this can be quantified by performing an element-by-element correlation between these two portions of the RDM (Figure 4C–ii, ‘stimulus-based’ representation). Likewise, because the recognition content of a post-disambiguation Mooney image is similar to that of its corresponding grayscale image despite different stimulus input (e.g. ‘it’s a crab!’), neural activity reflecting recognition-content processing should exhibit a similar pattern between the Post-Post and Gray-Gray squares of the RDM, and this effect can be quantified by an element-by-element correlation between these two portions of the RDM (Figure 4C–ii, ‘recognition-based’ representation). For simplicity and ease of interpretation, for this analysis we used RDMs calculated based on trial-averaged activity patterns using the 1 – r measure (Figure 4B).

Neural activity reflecting stimulus processing (indexed by r[Pre-Pre,Post-Post]) exhibited an early sharp peak reaching significance at 30–310 ms following stimulus onset and a small second peak that reaches significance at 1340–1420 ms (p<0.05, cluster-based permutation test; Figure 4E, green). By contrast, neural activity reflecting recognition processing (indexed by r[Post-Post,Gray-Gray]) occurs in a broad temporal period: it onsets shortly after stimulus onset and reaches significance at 690–1240 ms and again from 1550 ms to after stimulus offset (Figure 4E, magenta). As a control measure, the correlation between Pre-Pre and Gray-Gray squares of the RDM was not significant at any time point, suggesting that, as expected, representational geometry is different between conditions with different stimulus input and different recognition outcomes. In addition, an analysis using RDMs constructed with cross-validated Euclidean distances (for details, see Materials and methods, Euclidean distance) yielded similar results (figure not shown), suggesting that these findings are not driven by changing levels of noise between conditions. Nonetheless, likely due to insufficient statistical power, a direct contrast between r[Pre-Pre,Gray-Gray] and r[Post-Post,Gray-Gray] did not yield any significant time point following cluster-based correction. Thus, these results provide qualitative evidence – at the level of representational geometry – in accordance with our earlier conclusion that prior-guided visual disambiguation involves a dynamic two-part process including early stimulus-feature-related processing and late recognition-content-related processing. In the final analysis presented below, we will quantitatively test this possibility using a model-driven MEG-fMRI fusion analysis that simultaneously elucidates the spatial dimension of the evolving neural dynamics.

Model-driven MEG-fMRI data fusion spatiotemporally resolves neural dynamics related to stimulus, attention, and recognition processing

In order to spatiotemporally resolve neural dynamics underlying different information processing stages in prior-guided visual perception, we applied a recently developed model-driven MEG-fMRI data fusion approach (Hebart et al., 2018). Nineteen additional subjects performed a similar Mooney image task, with an identical set of Mooney and grayscale images, during whole-brain 7T fMRI scanning (for details see Materials and methods). Our earlier MVPA and RSA results obtained from this fMRI data set suggested the involvement of frontoparietal (FPN) and default-mode (DMN) network regions in Mooney image disambiguation, in addition to early and category-selective visual areas (González-García et al., 2018). Based on these findings, 20 regions-of-interest (ROIs) were defined, covering early visual cortex (V1-V4), lateral occipital complex (LOC), fusiform gyrus (FG), and regions in the FPN and DMN (Figure 6B).

Figure 6 with 1 supplement see all
Model-based MEG-fMRI fusion analysis.

(A) Model RDMs and analysis cartoon. Left: RDMs corresponding to ‘Stimulus’, ‘Recognition’, and ‘Attention’ models. For details, see Results. Right: MEG RDM from each time point and fMRI RDM from each ROI are compared, and shared variance between them that is accounted for by each model is computed. (B) ROIs used in the fMRI analysis. For details, see Materials and methods. These were defined based on a previous study (González-García et al., 2018). (C) Schematics for the commonality analyses employed in the model-based MEG-fMRI data fusion (results shown in E-F). Because neural activities related to the Stimulus and Recognition model overlap in time (see D), to dissociate them, variance uniquely attributed to each model was calculated (left). Shared variance between MEG and fMRI RDMs accounted for by the attention model is also assessed (right). (D) Correlation between model RDMs and MEG RDMs at each time point. Horizontal bars denote significant correlation (p<0.05, cluster-based permutation tests). (E) Commonality analysis results for Stimulus (i), Recognition (ii), and Attention (iii) models. Colors denote significant (p<0.05, cluster-based permutation tests) presence of neural activity corresponding to the model in a given ROI and at a given time point (with 10 ms steps). (F) Commonality time courses for the Stimulus (red) and Recognition (yellow) model (analysis schematic shown in panel C, left) for five selected ROIs, showing shared MEG-fMRI variance explained by each model. Total shared variance between MEG and fMRI RDMs for each ROI is plotted as gray shaded area. PCC: posterior cingulate cortex; R Frontal: right frontal cortex in the FPN. Horizontal bars denote significant model-related commonality (p<0.05, cluster-based permutation tests).

https://doi.org/10.7554/eLife.41861.009

We designed three model RDMs to capture different information processing stages (Figure 6A):

First, a ‘stimulus’ model, which captures dissimilarity structure based on physical image features. This model includes three levels of dissimilarity: low (blue diagonal in the Pre-Post square, capturing dissimilarity between the same Mooney image presented pre- and post-disambiguation); medium (white off-diagonal elements in the Pre-Pre, Pre-Post, Post-Post, and Gray-Gray squares, capturing dissimilarity between different Mooney images and between different grayscale images); and high (red off-diagonal elements in the Pre-Gray and Post-Gray squares, capturing dissimilarity between Mooney and non-matching grayscale images). Thus, this model considers both gross image statistics (black-and-white Mooney vs. grayscale) and features specific to each image (the same Mooney image presented across conditions). For simplicity, the diagonals in the Pre-Gray and Post-Gray squares are excluded from this model, since we do not have an a priori judgment about these values in comparison to the others.

Second, a ‘recognition’ model, which aims to capture content-specific recognition processing. This model includes three levels of dissimilarity: low (blue off-diagonal elements in the Pre-Pre square and blue diagonal in the Post-Gray square; we note that the equivalence of these two categories was arbitrary); medium (white Pre-Post and Pre-Gray squares); and high (red off-diagonal elements in the Post-Post, Post-Gray and Gray-Gray squares). This model capitalizes on the intuition that the content of recognition is image-specific, and postulates that neural representations of two recognized images that have different contents are most distinct from each other, those of two unrecognized images are most similar to each other, and the dissimilarity is intermediate between a recognized and an unrecognized image. Moreover, since a post-disambiguation Mooney image yields a similar recognition content as its matching grayscale image, the model assumes a low dissimilarity between them.

Third, an ‘attention’ model, which captures dissimilarity structure based on the recognition status. This model includes two levels of dissimilarity: low (blue elements in the Pre-Pre, Post-Post, Gray-Gray, and Post-Gray squares); and high (red elements in the Pre-Post and Pre-Gray squares). This model postulates that non-recognized images (Pre) are represented similarly to each other, and recognized images (Post and Gray) are represented similarly to each other, while recognized and non-recognized images are represented differently. Thus, the ‘attention’ model captures changes induced by the status of recognition regardless of content, such as heightened attention or arousal that accompanies recognition.

We first assessed the correspondence between each model and the MEG RDM (computed using the 1 – r measure, see Figure 4B) at each time point (i.e. model-based RSA; for example see Harel et al., 2014; Wardle et al., 2016; Vida et al., 2017). The results reveal distinct temporal waveforms for neural activity related to each model (Figure 6D): the stimulus model dominates in the early period before 500 ms and exhibits a second broad plateau between 600 ms and 2.5 s, reaching statistical significance at 50–410 and 620–2410 ms (p<0.05, cluster-based permutation test). By contrast, the recognition model dominates in the late period, from ~500 ms to 2.5 s, reaching significance from 670 ms until the end of trial epoch. Interestingly, the attention model shows a peak at the transition between stimulus and recognition models, around 500 ms after stimulus onset, and reaches significance at 400–570 ms. We note that the waveforms of neural activity related to stimulus and recognition models qualitatively agree with the ‘stimulus-based’ and ‘recognition-based’ neural activity identified in the earlier intra-RDM analysis (Figure 4E), even though the methods employed by these two analyses are distinct.

To elucidate the brain regions contributing to each process, we conducted a model-driven MEG-fMRI data fusion analysis, using a commonality analysis approach (Seibold and McPhee, 1979; Hebart et al., 2018). Because neural activity related to the stimulus and recognition models overlapped in time (Figure 6D), we first performed an analysis to decompose the amount of shared variance between fMRI RDM from a given ROI and MEG RDM at a given time point that is uniquely explained by each model, while excluding the other model’s effect (see schematic in Figure 6C, left, and eq. 5 in Materials and methods). The results of this analysis are shown in Figure 6Ei-ii. The stimulus-related effects (Figure 6E–i) had the earliest onset in the right V1 and V2 at 80 ms, followed by progressive recruitment of areas along the visual hierarchy. Stimulus-related effects in most higher order frontoparietal regions reached significance much later (after 400 ms), with the exception of bilateral frontal cortices (right: 230 ms; left: 240 ms) and the PCC (240 ms), where significant effects occurred nearly simultaneously with the last visual areas. Interestingly, after 600 ms, stimulus-related effects exhibit sustained significance across category-selective visual areas (LOC and FG) and frontoparietal regions (FPN and DMN), and, at the same time, recurrent transient significance in early visual areas (V1-V4). This pattern may be driven by a continued cross-talk between higher-order and lower-order regions related to stimulus-content processing while the image is present.

By contrast, the recognition-related effects (Figure 6E–ii) exhibited broad spatiotemporal significance from ~650 ms until the end of trial epoch, covering all ROIs investigated but with earlier onset and more sustained significance in higher-order regions (LOC, FG, FPN and DMN) than early visual areas. Stimulus and recognition-related commonality time courses (i.e. shared MEG-fMRI variance explained by each model) are plotted in Figure 6F for five selected ROIs across the cortical hierarchy (see Figure 6—figure supplement 1 for results from remaining ROIs).

Because neural activity related to the attention model did not overlap in time with the other two models (Figure 6D), we performed a second analysis to quantify the amount of shared variance between fMRI and MEG RDMs that is explained by the attention model (Figure 6C, right, and eq. 4 in Materials and methods). Attention-related effects were only found in bilateral frontal and parietal cortices, which reached significance first in frontal cortices at 340 ms (Figure 6E–iii).

Together, these findings reveal the spatiotemporal evolution of neural activities underlying different information processing stages of prior-guided visual ambiguity resolution, including stimulus-feature, attentional-state, and recognition-content-related processing. Importantly, they show that stimulus-feature processing progresses from lower to higher order brain regions in an early time window, while recognition-related processing proceeds from higher to lower order regions in a later time period. Moreover, attention-related processing mediated by frontoparietal regions occurs at an intermediate time latency (~500 ms) and may facilitate the transition from stimulus processing to successful recognition.

Discussion

Despite the pervasive need to resolve stimulus ambiguity (caused by occlusion, clutter, shading, and inherent complexities of natural objects) in natural vision (Olshausen and Field, 2005) and the enormous power that prior knowledge acquired through past experiences wields in shaping perception (Helmholtz, 1924; Albright, 2012), the neural mechanisms underlying prior-guided visual recognition remain mysterious. Here, we exploited a dramatic visual phenomenon, where a single exposure to a clear, unambiguous image greatly facilitates recognition of a related degraded image, to shed light on dynamical neural mechanisms that allow past experiences to guide recognition of impoverished sensory input. Below we summarize our findings and discuss their implications.

Using whole-head MEG data, we first characterized the temporal evolution of neural activity patterns differentiating between perceptual states: pre- vs. post-disambiguation Mooney images, where identical stimulus input begets distinct perceptual outcomes depending on whether a perceptual prior is available. We observed that large-scale neural activity patterns (in the <35 Hz range) recorded by MEG reliably distinguished perceptual states starting from 300 ms after stimulus onset, and that these activity patterns are sustained from 800 ms onward. Consistent with an earlier study (Baria et al., 2017), slow cortical potentials (<5 Hz) accounted for much of this effect. These multivariate findings complement previous univariate observations showing perceptual state-related changes in higher frequency power and connectivity patterns (Rodriguez et al., 1999; Grützner et al., 2010; Minami et al., 2014; Moratti et al., 2014). However, to understand how perceptual priors interact with stimulus input to resolve recognition, identifying perceptual state-related changes in neural activity is far from sufficient. Such changes may reflect disambiguation-induced changes in attention, salience or task difficulty that are unrelated to perceptual processing of individual image content. To elucidate how disambiguation influences dynamic neural representation of individual images, we calculated time-resolved RDMs, which quantify the dissimilarity (1 − r; Figure 4), distance (cross-validated Euclidean distance; Figure 4—figure supplement 2), and single-trial separability (rwithin − rbetween; Figure 5) between neural activity patterns elicited by every image pair (within and across perceptual conditions), and performed fine-grained analyses based on these RDMs.

We first observed that across-image dissimilarity for post-disambiguation Mooney images rises higher than for their pre-disambiguation counterparts starting from ~500 ms onward (Figure 4D), and, surprisingly, closely parallels that for the grayscale images despite enormous differences in image features and the strength of bottom-up input between them.

To disentangle neural activities underlying different information processing stages involved in prior-guided visual perception, we compared the neural representational geometry (Kriegeskorte and Kievit, 2013) between perceptual conditions (Figure 4E). This analysis capitalizes on the intuition that neural activity encoding physical stimulus features should have identical representation for a Mooney image presented pre- and post-disambiguation. We identified the temporal evolution of neural dynamics fulfilling this criterion, which exhibits a sharp peak within ~300 ms following stimulus onset, and, interestingly, a second small peak that onsets around 800 ms and reaches significance at 1.3 s. Speculatively, this second peak may reflect top-down feedback related to filling in of perceptual details that follows the initial object recognition (Ahissar and Hochstein, 2004; Campana and Tallon-Baudry, 2013). Likewise, neural activity encoding the content of recognized objects (e.g. a crab, a motorcycle) should have similar representation between a post-disambiguation Mooney image and its matching grayscale image. The time course of such neural activity slowly increases following stimulus onset and exceeds stimulus-feature-related processing starting from ~500 ms.

We further investigated separability – at the single-trial level – between neural activity patterns elicited by the same image (or matching Mooney and grayscale images) presented in different perceptual conditions. The results revealed that within 500 ms after stimulus onset, neural activity patterns elicited by the same Mooney image pre- and post-disambiguation are substantially more similar to each other (i.e. lower separability) than to other images (Figure 5D, green); additionally, the neural activity pattern elicited by a Mooney image is significantly separable from that elicited by its matching grayscale image (Figure 5C, orange and dark blue). By contrast, after ~500 ms post-stimulus-onset, at the single-trial level, the neural activity pattern elicited by a post-disambiguation Mooney image is indistinguishable from that elicited by its matching grayscale image (Figure 5C, dark blue), yet they are significantly separable from other grayscale images (Figure 5D, dark blue). In the same late time period, the neural activity pattern elicited by a pre-disambiguation Mooney image is well separable from that elicited by the same image presented after disambiguation or by its matching grayscale image (Figure 5C, orange and green). These results provide clear evidence for a strong and specific shift of neural representation of each image toward the particular relevant prior experience, which gradually builds up after stimulus onset and is full-blown from ~500 ms onward.

Thus, multiple analyses probing neural representation format using dissimilarity, distance, and single-trial separability are in accordance with each other: they reveal content-specific neural processing of stimulus input in an early (<300 ms) time period, which transitions into content-specific neural processing related to recognition outcome in a late (>500 ms) time period. We note that the latency related to recognition observed in the current study is substantially later than previously reported onset times of object category and face identity-related information (Carlson et al., 2013; van de Nieuwenhuijzen et al., 2013; Kaiser et al., 2016; Vida et al., 2017). Yet, the late recognition-related neural dynamics observed herein are consistent with human subjects’ reaction times reporting recognition status for disambiguated Mooney images at around 1.2 s (Hegdé and Kersten, 2010). Several aspects likely contribute to this difference: first, bottom-up stimulus information is ambiguous and much weaker for Mooney images than images showing clear and isolated objects typically used in object categorization tasks; second, our analyses probe neural activity related to recognizing individual image content rather than object category, and processing of face identity likely benefits from a specialized circuitry; third, neural activity differentiating between object categories may also reflect early processing of low-level image features that differ between categories (Coggan et al., 2016). Importantly, the present finding of slow neural dynamics underlying recognition-related processing is consistent with our hypothesis that recruitment of perceptual templates encoded in higher order frontoparietal areas and long-distance recurrent activity are needed for prior-experience-guided visual ambiguity resolution. This hypothesis receives further support from the model-driven MEG-fMRI fusion analysis, which we discuss below.

We capitalized on the ability of RSA to project high-dimensional neural data from different modalities into a common representational space (Kriegeskorte et al., 2008a) to combine the high temporal resolution of MEG data with high spatial resolution of a separate 7T fMRI data set. In addition, we adopted a recently developed model-driven data fusion approach building on RSA (Hebart et al., 2018) to spatiotemporally resolve neural activity underlying different information processing stages. To this end, we constructed three model RDMs that capture the representation format of idealized processes related to stimulus feature, recognition content, and attentional state processing (Figure 6A). Although these models capture relatively coarse features in the data, the time courses of MEG activity related to each model (Figure 6D) are consistent with the earlier fine-grained analysis applied to MEG RDMs, including the dominance of stimulus and recognition processing in the early (<500 ms) and late (>500 ms) time period, respectively, and a late second peak in stimulus-related processing. Interestingly, neural activity related to the attention model, which captures dissimilarity driven by recognition status (recognized vs. non-recognized), shows a single peak around 500 ms, suggesting that a transient salience signal may facilitate recognition processing of individual image content.

The model-driven MEG-fMRI data fusion analysis dissects the shared variance between MEG RDM at a given time point and fMRI RDM from a given brain region that is uniquely accounted for by each model. Consistent with our earlier fMRI findings (González-García et al., 2018), recognition-related neural activity is widely distributed across the cortical hierarchy – from early visual areas to frontoparietal and default-mode networks (Figure 6E–ii). Importantly, this analysis reveals that recognition-related activity reaches significance at an earlier time in higher-order regions (LOC, FG, FPN and DMN, 670–680 ms) than in lower-order regions (V1-V3, 760 ms or later). This sequence is consistent with our hypothesis that higher-order brain regions are crucial for encoding perceptual priors and initiating prior-guided recognition.

We note that bilateral frontal cortices of the FPN are the only regions showing both attention-related activity and early stimulus-related activity that reaches significance at 230–240 ms – immediately following fusiform gyri at 220–230 ms. This result resonates with our previous fMRI observation that following disambiguation, frontal areas of the FPN move up the cortical hierarchy as defined by the neural representation format (González-García et al., 2018). Together, these findings support the idea that frontal areas may play a special role in utilizing internal priors to guide perceptual processing (Bar et al., 2006; Summerfield et al., 2006; Wang et al., 2013).

In conclusion, the present results reveal, for the first time, how neural activities underlying different information processing stages during prior-guided visual perception dynamically unfold across space and time. These findings significantly further our understanding of the neural mechanisms that endow previous experiences with enormous power to shape our daily perception. In line with theories positing aberrant interactions between internal priors and sensory input in psychiatric illnesses (Friston et al., 2014), behavioral and neural abnormalities associated with Mooney image disambiguation have been reported in patients with schizophrenia and autism (Sun et al., 2012; Rivolta et al., 2014; Teufel et al., 2015). Thus, our findings may also inform studies on the pathophysiological processes involved in these debilitating illnesses.

Materials and methods

Subjects

The experiment was approved by the Institutional Review Board of the National Institute of Neurological Disorders and Stroke (under protocol #14 N-0002). All subjects were right-handed and neurologically healthy with normal or corrected-to-normal vision. Eighteen subjects between 21 and 33 years of age (mean age 26.2; nine females) participated in the MEG experiment. Nineteen additional subjects (age range = 19–32; mean age = 24.6; 11 females) participated in a 7T fMRI experiment, using a similar task paradigm as in the MEG experiment (González-García et al., 2018). The two subject groups did not have any overlap since subjects needed to be naïve to the Mooney images used in this experiment. All subjects provided written informed consent.

Stimuli

Request a detailed protocol

Mooney and grayscale images were generated from grayscale photographs of real-world objects and animals selected from the Caltech (http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html) and Pascal VOC (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) databases. Grayscale images were constructed by cropping photographs of objects and animals in a natural setting to 500 × 500 pixels and applying a box filter. Mooney images were constructed by thresholding the grayscale images. Threshold level and filter size were initially set at the median intensity of each image and 10 × 10 pixels, respectively. Each parameter was titrated so that the Mooney image was difficult to recognize without first seeing the corresponding grayscale image. From an original set of 252 images, thirty-nine (19 were inanimate objects, and 20 were animals – unbeknownst to the subjects) were chosen to be used in this experiment via an initial screening procedure conducted by six additional participants recruited separately from the main experiment [for details, see (Chang et al., 2016). Stimuli were presented using E-Prime Software (Psychology Software Tools, Sharpsburg, PA) via a Panasonic PT-D3500U projector with a ET-DLE400 lens. All images subtended 11.9 × 11.9 degrees of visual angle.

Task paradigm

Request a detailed protocol

Each trial began with a 1 s fixation period during which subjects fixated on a red cross in the center of the screen (Figure 1A). Thereafter, a Mooney image or a grayscale image was presented for 2 s. The red cross was present during image presentation, and subjects were instructed to keep their gaze fixated whenever it was onscreen. After image presentation, there was another fixation period, the duration of which was uniformly distributed between 1 and 2 s. This was followed by a response prompt of ‘Can you recognize the object hidden in the image?’ to assess subjective recognition of each image presentation. Below this prompt, the answer choices, ‘Yes’ and ‘No’ were presented on each side of the screen with their positions randomly varied across trials. Subjects were instructed to answer the question by pressing one of two buttons using their right thumb, with each button corresponding to one side of the screen. The response prompt terminated when a response was given, and each trial ended with a blank screen of jittered duration uniformly distributed between 1.5 and 2.5 s.

Trials were organized into blocks, using a structure similar to previous studies (Gorlin et al., 2012; Chang et al., 2016). Each block consisted of 15 trials: three different grayscale images followed by six Mooney images, then a repeat of the same six Mooney images (Figure 1B). Three of the Mooney images corresponded to the preceding grayscale images (‘post-disambiguation’) and the other three were novel (‘pre-disambiguation’). The presentation order of these six Mooney images in each repeat was randomized. The block was repeated twice with shuffled grayscale image sequences and shuffled Mooney image sequences, followed by a verbal test session. This constituted one experimental run. Grayscale images corresponding to pre-disambiguation Mooney images were presented in the subsequent run. In total, each participant completed 14 runs. For all Mooney images to be presented both pre- and post- disambiguation, the first and last runs were half runs. The first run included only three pre-disambiguation Mooney images. The final run included three post-disambiguation Mooney images and their grayscale counterparts. All other full runs consisted of three grayscale images, three post-disambiguation Mooney images, and three novel pre-disambiguation Mooney images. In all, each unique grayscale image was presented three times for each subject, and each unique Mooney image was presented six times before and six times after disambiguation. The full experiment lasted approximately 2 hr.

The verbal test was included to verify that subjects’ recognition of Mooney images was the correct interpretation. It consisted of presenting the six different Mooney images from the preceding run for 2 s each on the screen and participants were asked to verbally respond what they saw in the image. They were allowed to answer ‘I don’t know’. Verbal responses were scored as correct or incorrect using a pre-determined list of acceptable answers for each image. Subjects were verbally tested on each Mooney image once before disambiguation and once after disambiguation. No MEG signal was recorded during the verbal test.

Of the 39 unique Mooney images used for the main experiment, 33 had their corresponding grayscale images presented (‘real image sets’), while the remaining six were presented with non-matching grayscale images (‘catch image sets’) as controls. The same set of real and catch images were used for all subjects. Details on statistical analyses can be found in the following sections and in the Results.

MEG data acquisition

Request a detailed protocol

While performing the task, subjects’ brain activity was recorded with a 275-channel whole-head MEG system (CTF), and their gaze position and pupil size were recorded using a SR Research Eyelink 1000+ system. Eye-tracking was used for online monitoring of fixation during the experiment. Three dysfunctional MEG sensors were removed from all analyses. MEG data were collected with a 600 Hz sampling rate and an anti-aliasing filter of 150 Hz. Before and after each run, the head position of a subject was measured using fiducial coils. Subjects responded to subjective recognition questions using a fibreoptic response button box. All MEG data samples were corrected with respect to the presentation delay of the projector (measured with a photodiode).

MEG data preprocessing

Request a detailed protocol

The FieldTrip package implemented in MATLAB (Oostenveld et al., 2011) was used for preprocessing in conjunction with custom-written code in MATLAB (Mathworks, Natick, MA). MEG data were demeaned, detrended, and filtered in two different frequency bands (using 3rd-order Butterworth filters) for further analyses: slow cortical potentials (SCPs, 0.05 – 5 Hz; down-sampled to 10 Hz), and classic event-related field (ERF) frequency range (DC – 35 Hz; down-sampled to 100 Hz). Independent component analysis (ICA) was applied to continuous data from each run, and components corresponding to eye blinks, eye movements, and heartbeat-related artifacts were removed. Data were then epoched into 3.5 s trials consisting of a 0.5 s pre-stimulus period and a 3 s post-stimulus period (including 2 s image presentation and the first sec of jitter-fixate period; see Figure 1A). Baseline correction was applied for each sensor using the pre-stimulus time window.

Multivariate pattern analysis (MVPA)

Request a detailed protocol

MVPA was carried out using both the 0.05 – 5 Hz data and the DC – 35 Hz data. First, MEG activity at each time sample was normalized across sensors (Pereira et al., 2009). For each subject, classification of perceptual state (pre- vs. post-disambiguation) was performed using activity from all MEG sensors averaged across six presentations for each image in each perceptual state, using all 33 non-catch image sets. We implemented a linear support vector machine (SVM) classifier (cost = 1) using the LIBSVM package (Chang and Lin, 2011) at each time point in the trial epoch. An odd-even cross-validation scheme was used, classification accuracy was averaged across the two folds and reported as balanced accuracy (Brodersen et al., 2013). Activation patterns corresponding to the MEG activity contributing to the classifier were computed for each subject and time point by multiplying the vector of SVM decoder weights with the covariance matrix of the data set used to train a given classifier (Haufe et al., 2014). For display purposes, activation patterns were averaged across subjects. To test the cross-time generalization of classifiers, a classifier trained at a given time point was tested at all time points in the trial epoch, yielding a temporal generalization matrix (TGM)(Stokes et al., 2013; King and Dehaene, 2014). If a classifier can generalize across time points, this demonstrates that the decoded information format is similar across these time points. If it does not generalize, this indicates that the information is represented differently or not at all. A control analysis was carried out in a similar manner but using only the six catch image sets, for which pre- and post- states were defined as before and after the presentation of the artificially-assigned, non-matching grayscale images. An additional analysis (disambiguation decoding) used only non-catch image sets where Mooney image were unrecognized/non identified in the pre-disambiguation stage and recognized/identified in the post-disambiguation stage. These image sets were selected for each individual subject based on the behavioral responses.

Cluster-based permutation tests for multivariate pattern decoding

Request a detailed protocol

The group-level statistical significance of classifier accuracy at each time point was assessed by a one-tailed, one-sample Wilcoxon signed rank test against chance level (50%). To correct for multiple comparisons, we used cluster-based permutation tests (Maris and Oostenveld, 2007). Temporal clusters were defined as contiguous time points yielding significantly above-chance classification accuracy (p<0.05). The test statistic W of the Wilcoxon signed rank test was summed across time points in a cluster to yield a cluster’s summary statistic. Cluster summary statistics were compared to a null distribution, constructed by shuffling class labels 500 times, and extracting the largest cluster summary statistic for each permutation. Clusters in original data with summary statistics exceeding the 95th percentile of null distribution were considered significant (corresponding to p<0.05, cluster-corrected). For classifier temporal generalization, the permutation-based approach for cluster-level statistical inference used the same procedure as above, where clusters were defined as contiguous time points in both training and generalization dimensions with significant (p<0.05) above-chance classification accuracy.

Representational similarity analysis (RSA)

Request a detailed protocol

For this and following analyses, to achieve higher temporal resolution, we used data filtered in the DC – 35 Hz range (down-sampled to 100 Hz). Similar to the preprocessing for MVPA analysis, MEG data were normalized across sensors at each time point. For each subject, data were averaged across the three presentations for each grayscale image. In order to compare between Mooney images and grayscale images, for each Mooney image the first three presentations were averaged together in the pre- and post-disambiguation stage, respectively. At each time point in the trial epoch, we computed representational distance, calculated as 1 – Pearson’s r (computed across all sensors), between every image pair in all presentation stages: pre-disambiguation Mooney, post-disambiguation Mooney, and grayscale (Figure 4A). This generated a 99 × 99 representational dissimilarity matrix (RDM) at each time point (with 10 ms steps). Catch image sets were excluded from this analysis.

Using the time-resolved RDM, we averaged across all image-pairs within each presentation condition (i.e. off-diagonal elements within the Pre-Pre, Post-Post, and Gray-Gray squares of the RDM, see Figure 4C–i) to yield a mean dissimilarity time course for each condition. Dissimilarity time courses were compared between conditions using a Wilcoxon signed-rank test across subjects and corrected for multiple comparisons using a cluster-based permutation test similar to that described above, with 5000 shuffles of class labels for each subject.

Euclidean distance

Request a detailed protocol

To ensure that the results were unaffected by changing levels of noise between perceptual conditions, the above analysis was repeated using RDMs constructed with Euclidean distance and cross-validated Euclidean distance. Let x and y represent neural activity vectors elicited by two different images (or the same image presented in two different conditions), (non-cross-validated) Euclidean distance is calculated as

(1) dEuclidean2(x,y)=(xy)(xy)T,

and cross-validated Euclidean distance is calculated as

(2) dEuclidean, c.v.2(x,y)=(xy)[A](xy)[B]T,

where A and B denote the two partitions of the data within each cross-validation fold (Guggenmos et al., 2018). We used a three-fold cross-validation scheme to calculate cross-validated Euclidean distance. With cross-validation, the contribution by noise in the data cancels out, and the result is only driven by the component in the data that is consistent across partitions.

Intra-RDM analysis

Request a detailed protocol

To probe fine-grained information available in the RDMs, we performed an intra-RDM analysis (for detailed rationale, see Results). This analysis contained two components: First, to assess neural activity related to processing the physical features of individual images, we calculated Pearson’s correlation between the Fisher-z transformed values of Pre-Pre and Post-Post squares of the MEG representational similarity matrices (RSMs, equal to 1-RDM) at each time point, using the upper triangle of each (Figure 4C–ii, ‘Stimulus-based’). Second, to assess neural activity related to the recognition outcomes of individual images, we calculated Pearson’s correlation between the Fisher-z transformed values of Post-Post and Gray-Gray squares of the MEG RSM, again using the upper triangle of each (Figure 4C–ii, ‘Recognition-based’). The correlation values were Fisher-z transformed, and group-level statistics were assessed by one-sample t-tests against zero followed by cluster-based permutation tests with 5000 permutations.

Single-trial separability

Request a detailed protocol

To assess how well neural activities elicited by Image A and Image B (importantly, these can represent two different images presented in the same or different conditions, or the same/matching image presented in different conditions) can be separated at the single-trial level, we computed a separability measure as follows.

Let xi and yi be the activity vectors across all sensors on the i-th presentation of Image A and Image B, respectively. And suppose that Image A and Image B are presented for a total of m and n trials, respectively. (Each Mooney image is presented six times before and six times after disambiguation, and each grayscale image is presented three times total.) We calculated the following measures:

rwithin= 2m(m1)i=1m1j=i+1mrz(xi,xj)+2n(n1)i=1n1j=i+1nrz(yi,yj)
rbetween= 1mni=1mj=1nrz(xi,yj)

where rz denotes Fisher-z-transform of Pearson’s correlation r-value. Separability is calculated as

(3) SeparabilityA,B=rwithin-rbetween

Thus, separability quantifies how much neural similarity across multiple presentations of the same image exceeds neural similarity between the two different images and is akin to single-trial decoding.

We constructed the time-resolved 99 × 99 RDMs using the separability measure, which were subjected to two further quantitative analyses. In the first analysis, we extracted the mean of diagonal elements in the between-condition squares, which yielded three time-dependent outputs (Figure 5C–i). In the second analysis, we computed the difference between the mean off-diagonal value and the mean diagonal value within each between-condition square, which again yielded three time-dependent outputs (Figure 5D–i). For each analysis, we evaluated the statistical significance of each output against chance level using a one-sample t-test against 0, and the statistical significance of pairwise comparisons between the outputs using Wilcoxon sign-rank tests; all statistical tests were corrected for multiple comparisons using cluster-based permutation tests with 5000 permutations.

7T fMRI data collection and regions of interest (ROI) definition

Request a detailed protocol

We carried out an fMRI study using a similar paradigm, which included the same 33 Mooney images and their grayscale counterparts that made up the non-catch image sets in the MEG experiment. Run and block structure were identical to the MEG experiment. In the fMRI experiment, each trial included a 2 s fixation period, a 4 s image presentation, a 2 s blank period, and a 2 s response period to assess subjective recognition (similar question as in the MEG experiment). Detailed methods and results related to the fMRI study are reported separately (González-García et al., 2018); here we describe data collection and ROI definition procedures relevant to the current study.

fMRI data were collected on a Siemens 7T scanner equipped with a 32-channel head coil (Nova Medical). T1-weighted anatomical images were obtained using an MP-RAGE sequence (sagittal orientation, 1 × 1 × 1 mm resolution). Additionally, a proton-density (PD) sequence was used to obtain PD-weighted images also with 1 × 1 × 1 mm resolution, to help correct for field inhomogeneity in the MP-RAGE images (Van de Moortele et al., 2009). Functional images were obtained using a single-shot echo planar imaging (EPI) sequence (TR = 2000 ms, TE = 25 ms, flip angle = 50°, 52 oblique slices, slice thickness = 2 mm, spacing = 0 mm, in-plane resolution = 1.8 × 1.8 mm, FOV = 192 mm, acceleration factor/GRAPPA = 3). The functional data were later resampled to 2 mm isotropic voxels. Respiration and cardiac data were collected using a breathing belt and a pulse oximeter, respectively, simultaneously with fMRI data acquisition; and physiological noise were removed during preprocessing of fMRI data using the RETROICOR method (Glover et al., 2000). Anatomical and functional data preprocessing followed standard procedures and are described in detail in González-García et al. (2018).

ROIs were defined as follows. A separate retinotopic localizer and a lateral occipital complex (LOC) functional localizer were performed for each subject to define bilateral early visual ROIs (V1, V2, V3, and V4) and LOC, respectively. Fusiform gyrus (FG) ROIs were extracted using the Harvard-Oxford Cortical Structural Atlas (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases). Default mode network (DMN) regions, including bilateral lateral parietal cortices (LatPar), medial prefrontal cortex (mPFC), and posterior cingulate cortex (PCC), were defined using a general linear model (GLM) of the disambiguation contrast (pre-disambiguation-not-recognized vs. post-disambiguation-recognized). Lastly, statistical map from searchlight MVPA decoding of the disambiguation contrast was used to define the frontoparietal network (FPN) ROIs: bilateral frontal and parietal cortices. The localization of all ROIs are shown in Figure 6B; for further details see (González-García et al., 2018). Critically, none of the analyses used to define the ROIs depended on the RDMs used for MEG-fMRI data fusion analysis described below; in addition, results obtained using FPN and DMN ROIs defined based on an independent resting-state study (Power et al., 2011) were similar (not included due to length consideration).

For each of the 20 ROIs, a 99 × 99 RDM was constructed using the activation patterns corresponding to each image in each condition, using the same image order as for the MEG RDM. These activation patterns were derived from a GLM and averaged across image presentations (three presentations for grayscale images; only the first three presentations were used for Mooney images in each stage in order to equalize statistical power across conditions). Representational distance between every image pair was computed as 1 – Pearson’s r (computed across all voxels within an ROI), similarly as for MEG RDM.

Model-based representational similarity analysis (RSA) of MEG data

Request a detailed protocol

RSA enables neural representation format to be compared with theoretical models of information representation (Kriegeskorte et al., 2008a). Here, we used a priori defined models to probe how neural representation of different types of information dynamically evolved over time. Three models were defined as RDMs (same dimensions as the MEG RDMs) where high dissimilarity was expressed as 1 (Figure 6A, red) and low dissimilarity was expressed as a 0 (Figure 6A, blue), with intermediate dissimilarity expressed as 0.5 (Figure 6A, white). Black diagonals in these model RDMs (Figure 6A) denote elements excluded from the model’s analysis (defined as NaN’s).

For each model and time point, Spearman correlation was computed between the upper triangles of model RDM and group-averaged MEG RDM to assess how well the model explained the MEG RDM (Figure 6D). Statistical significance was established using a cluster-based permutation test. The null distribution was calculated using 1000 permutations of the MEG RDM, where for each permutation the image order was shuffled, but with the same shuffled order along the x- and y- axis of the RDM and across all time points. Clusters were defined as contiguous time points where the Spearman rho value was greater than the 95th percentile of the null distribution. To identify significant clusters, we determined the 95th percentile of maximum cluster size across all permutations, and clusters in the original data that exceeded this cut-off were deemed significant (equivalent to p<0.05, one-sided).

Model-based MEG-fMRI data fusion

Request a detailed protocol

Following previous studies (Kriegeskorte et al., 2008b; Cichy et al., 2014), we employed cross-modal RSA to combine fMRI and MEG data from independent participant groups (N = 19 and 18, respectively). Furthermore, we applied a recently developed approach based on commonality analysis (Seibold and McPhee, 1979; Hebart et al., 2018) to use the theoretical models described in the previous section to guide the cross-modal MEG-fMRI data fusion.

Specifically, commonality analysis was employed in two ways. First, it was used to determine the shared variance between a model (M1 in eq. 4), the fMRI RDM from a given ROI and the MEG RDM at a given time point (Figure 6C, right), calculated as:

(4) CMEG.fMRI,M1=RMEG.fMRI2+RMEG.M12-RMEG.fMRI,M12

Second, it was used to determine the shared variance between a model, the fMRI RDM and the MEG RDM which is unique to that model, and is not shared by a second model (Figure 6C, left). This allows the dissociation of the respective contributions of each model to the shared variance between fMRI and MEG RDMs. The commonality (i.e. shared variance between model, fMRI and MEG RDM) for a model of interest (M1 in eq. 5) that is not explained by a second model (M2) is calculated as follows:

(5) CMEG.fMRI,M1= RMEG.fMRI,M22+RMEG.M1,M22-RMEG.M22-RMEG.fMRI,M1,M22

(RDM elements that contain NaN values in any model are excluded from the analysis.)

Significance for these commonalities was determined using cluster-based permutation tests, following the same method as explained above for model-based MEG RSA. Further, significant clusters in Figure 6E were masked by only including spatiotemporal locations where both MEG-Model RDM correlation and fMRI-Model RDM correlation are significant. Lastly, we assessed the correspondence between group-averaged MEG and fMRI RDMs using Spearman correlation, resulting in a cross-modal RDM similarity time course for each ROI (Figure 6F, gray shading). Specifically, the squared Spearman rho was calculated and compared with the model-based commonality measures derived from eq. 4 or eq. 5. The squared Spearman rho provides the upper bound for the maximal amount of shared variance between MEG and fMRI RDMs to be explained by the theoretical models (Hebart et al., 2018).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    LIBSVM: a library for support vector machines
    1. C-C Chang
    2. C-J Lin
    (2011)
    ACM Transactions on Intelligent Systems and Technology 2:1–27.
  11. 11
  12. 12
    The role of visual and semantic properties in the emergence of Category-Specific patterns of neural response in the human brain
    1. DD Coggan
    2. DH Baker
    3. TJ Andrews
    (2016)
    eNeuro, 3, 10.1523/ENEURO.0158-16.2016, 27517086.
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
    Treatise on Physiological Optics
    1. HV Helmholtz
    (1924)
    New York: Optical Society of America.
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54

Decision letter

  1. Christian Büchel
    Reviewing Editor; University Medical Center Hamburg-Eppendorf, Germany
  2. Michael J Frank
    Senior Editor; Brown University, United States
  3. Christian Büchel
    Reviewer; University Medical Center Hamburg-Eppendorf, Germany

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for sending your article entitled "Neural dynamics of visual ambiguity resolution by perceptual prior" for peer review at eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor.

The reviewers have indicated that the topic of the paper, namely identifying the temporal dynamics underlying the Mooney recognition effect and controlling for non-content-specific effects such as increased attention, salience or decreased task difficulty is very valuable and novel. However, the reviewers remain cautious about whether the data presented conclusively allows to address these claims.

1) The paper claims that the results are image-specific, but Figure 4D shows condition-specific results. In particular, Figure 4D only shows that recognition decreases between stimulus pattern dissimilarity. But this effect can be driven by many factors, e.g. decreased noise.

2) Figure 4E shows a significantly positive correlation between the representational similarity structure for recognized Mooney images and unambiguous images, but not that this effect is greater for post vs. pre Mooney images. Therefore, this effect cannot be conclusively related to Mooney image disambiguation.

Reviewer #1:

This paper is on perceptual processing with respect to prior knowledge. They use Mooney images, (binary images without recognizable content), which after one has seen the underlying grey scale image, are easily identified. This is a very powerful perceptual effect and allows the investigation of how prior information affects recognition. The paper uses MEG (and fMRI) in combination with a series of clever time resolved decoding approaches to show "time-courses of dissociation".

In addition they used representational similarity analysis (RSA) to show the time-course of similarities between pre and post (same physical stimulus) and post and gray (same percept that is recognition). Not too surprisingly, they show that these time resolved similarities differ, with the stimulus based similarities peaking early and the recognition based similarities peaking later.

Finally, they employ a powerful model based RSA approach where they investigate the commonalities of RSA based on MEG (as before), fMRI and a theoretically predicted RSA (i.e. a model). The model can incorporate recognition (high similarity between post and gray) etc. Importantly, by looking for commonalties across MEG and fMRI, they can, based on pattern similarity, fuse fMRI and MEG. Although clever and informative a similar approach has already been published (visual object recognition) (Cichy et al., 2014).

Although the presented data are very interesting and show what can be done with a clever multivariate methods, including model-based RSA analyses, the promise of the title "Neural dynamics of visual ambiguity resolution by perceptual prior" is not fulfilled by this paper. Potentially, this data could give us some insights on how the integration of prior and incoming visual information works. This is only vaguely addressed, e.g. by data shown in Figure 3C.

In addition, one could argue that the novelty of this paper is only incremental: In a previous paper by Cichy et al., 2014, and a subsequent paper by Hebart describing a similar approach using model based RSA (Hebart et al., 2018) similar results were obtained. They studied object recognition, which is also based on prior information (volunteers know the objects and have seen them before in a different manner), although there is no control condition (i.e. identical visual stimulus, but different percept) as in a Mooney faces experiment.

The current paper should either provide more information about the neural dynamics of visual ambiguity resolution or at least explain how their approach adds novel insights over and above the papers mentioned above.

Reviewer #2:

This study uses fMRI and MEG and an accomplished psychophysical paradigm to investigate how experience-driven image recognition affects neural responses – across time and space. To do so, it uses several multivariate analyses as well as multivariate MEG-fMRI data fusion.

The key findings are that (i) that experience-driven recognition can be decoded from 300ms onwards based on MEG response patterns, (ii) that this information (see i) persists in a stationary manner over time, (iii) that recognition increases MEG pattern distances between Mooney images, (iv) that the representational geometry (MEG-based) of recognized Mooney images correlates significantly with that for the corresponding original gray-scale images and (v) that shared variance between MEG and fMRI RDMs uniquely accounted for by a recognition model is widespread in the brain (in all ROIs) from 500ms onwards.

The methodology is without doubt advanced and the general question that this study is supposed to address is of wide general interest. However, my first main concern is that the authors do not introduce a specific hypothesis nor outline exactly how this research is going bring us closer to understanding how experience guides recognition. As a result, the study, although being informative, comes across as "fishing expedition".

Another major concern is that the authors do not report statistically solid univariate findings, which makes it impossible to relate the findings reported here to previous imaging studies employing a similar paradigm. The authors do present SVM weight-maps. However, these maps are anecdotal at best as they are not statistically evaluated in any way. Reporting univariate fMRI data would also be extremely valuable, as it would for example enable readers to assess how the MEG-fMRI modeling results relate to fMRI response amplitude (and SNR).

Furthermore, the authors make a claim that is not fully supported by their findings: they state that "This analysis showed that image-specific information for post-disambiguation Mooney images rises higher than their pre-disambiguation counterparts starting from ~500 ms" based on finding iii. This is misleading, because greater between-image pattern distances do not directly imply greater stimulus information. This finding could, for example, also be explained by noisier responses for recognized Mooney images.

Another issue is that a crucial test is missing related to finding iv (Figure 4E). This finding implies that recognizing Mooney images causes representational geometry (MEG based) to become more similar to that for the corresponding set of gray-scale images. However, the authors need to demonstrate that this increase in RDM-RDM similarity is significantly greater for the Post RDM as compared to the Pre RDM.

Finally, I don't see why finding ii is of interest (Figure 3D). To me it is unclear what sets this case of (MEG) pattern information persistence apart from previous reports if this phenomenon (e.g. Carlson et al., 2013), and how it functionally relates to experience-driven recognition.

Given these issues, I do not recommend publication of this manuscript in its current state.

Reviewer #3:

Summary:

Flounders and colleagues investigated how the visual and cognitive processing during recognition of images unfolds over time. To pinpoint the effects of the prior, including knowledge of the image content and expectation, they presented participants with two-tone "Mooney" images that are initially difficult to recognize but after disambiguation allow recognition of the stimulus (take the famous picture of the Dalmatian dog as an example). Using MEG decoding, the authors show striking differences between Mooney images before recognition ("pre") vs. after recognition ("post") emerging after ~300 ms. Using representational similarity analysis (RSA) to compare between different experimental phases, they show that, as expected, the similarity of stimulus-related patterns of activity increases rapidly after stimulus onset (comparing pre and post-recognition Mooney images), while recognition-related patterns emerge much later ~800 ms (comparing real images with post-recognition). By relating their results to a separate 7T fMRI dataset with model-based RSA, for several regions of interest (ROIs) the authors reveal time courses of information specific to different model components, related to stimulus-based, recognition-based and attention-based processes. They find that stimulus-based processes exhibit an early and a late peak, recognition-based processes dominate throughout time and region after ~500 ms, and attention-based processes are very specifically located to frontal and parietal ROIs at specific time points. They interpret these results in light of the effects of prior experience on object recognition.

Assessment:

The authors address a timely and interesting question of how prior experience affects visual and cognitive processing in the human brain. The manuscript uses state-of-the-art methodology in MEG decoding, RSA and MEG-fMRI fusion, and all statistical analyses appear to be sound. I specifically liked the combination of MEG and fMRI data for spatiotemporally-resolved analysis, and the related results are fascinating. In addition, I very much liked the addition of a control condition to make sure the results are not merely due to stimulus repetition (post-recognition images have been seen more frequently than pre-recognition stimuli) or simple stimulus-association effects.

At the same time, I believe the authors make some claims not supported by the data. They highlight that part of the novelty of their work has to do with the fact that previous work on this topic did not reveal image-specific results and, indeed, the authors do report image-specific findings in Figure 4E. However, in contrast to the authors' claim, the other effects using RSA are likely not stimulus-specific. For example, the results in Figure 4D are averaged across stimuli, leading to condition-specific effects. To achieve stimulus-specific effects, the authors would have to either identify the similarity for the same stimulus to itself or identify the difference between same stimulus and different stimulus within different periods of the experiment. They could do this by carrying out a split-half analysis and calculating the difference (within – between). This would be equivalent to a stimulus-specific decoding analysis. I think this kind of analysis would be useful to support their results. Alternatively, the authors may want to adjust this description of their results with respect to stimulus-specificity in the Materials and methods, Results, and Discussion.

A similar argument could be made regarding the model-based MEG-fMRI fusion results. The stimulus-specific model focuses on gross differences between Mooney images and greyscale images, rather than individual images. The recognition-specific model assumes that images post-recognition all become different from each other, which would lead to high dissimilarity. However, in line with the authors' interpretation of their prior work (Gonzalez-Garcia et al., 2018), one could also argue that they should become more similar to each other (when treated as the class of objects rather than individual images). In addition, their model interpretation would assume that the image itself should at least become more similar to itself, i.e. according to their interpretation, in my understanding the model would have to contain off-diagonal elements for the same image between grayscale and post-recognition periods.

To strengthen their conclusions, I would suggest the addition of stimulus-specific or at least category-specific (e.g. animate – inanimate) decoding analyses. Further, I would suggest carrying out a category-specific analysis (e.g. animate – inanimate) to confirm the claims that the results are indeed recognition-related.

While, as mentioned above, the addition of a control analysis is great, it only makes up a fraction of the other conditions. Therefore, the absence of decoding or RSA effects may be due to reduced power. What would the equivalent analysis look like for the experimental data if it is similarly reduced in size?

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Neural dynamics of visual ambiguity resolution by perceptual prior" for further consideration at eLife. Your revised article has been favorably reviewed by three reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

As you will see below there are remaining issues with Figure 4E as outlined in the comment by reviewer #2. The new Figure 5 needs further clarifications for the reader and there is one outstanding issue related to how exemplar i.e. image specific these findings are. Finally the model needs either adjustment or a justification in the Discussion section (see comment by reviewer #3).

Reviewer #2:

I am glad to see that the authors have found an elegant way of addressing my most important issue – the lack of direct support for the claim of recognition driven content-specific effects. I think that this issue is now covered by the results shown in Figure 5.

I still would like to insist on a revision of Figure 4E. The pink line in the figure is referred to as a "recognition-based" time course. This is misleading as the difference between the pink and the green line can be explained by differences in representational geometry due to recognition and due to differences related to presenting a Gray vs. Mooney images. Hence, there is no (significant) evidence for this effect being more pronounced for the Post-Gray than the Pre-Gray comparison. Therefore, the authors should explicitly clarify that the difference shown in this figure cannot be conclusively attributed to recognition. They could also simply omit 4E as the content specific recognition effect is now directly demonstrated in Figure 5. The authors could also consider to conceptually link Figure 4 and 5 by clarifying that Figure 4D leaves open the question if recognition-driven enhanced (and equal-to Gray) representational dissimilarity for Mooney images is driven by activation patterns to Mooney images becoming more similar to their Grayscale counterparts. At present, a compelling motivation for the Figure 5 analysis is missing at the start of the corresponding Results section.

With respect to my introduction-related comment, I still think that the clarity of the paper would be enhanced by including a more specific hypothesis. As noted in the rebuttal, this presently boils down to this sentence: "these observations raise the intriguing possibility that slow, long-distance recurrent neural dynamics involving large-scale brain networks are necessary for prior-experience-guided visual recognition." I find it hard to see how this rather broad and vague hypothesis is a natural motivation for the specific research preformed during this study nor how it is precisely addressed by the findings.

Reviewer #3:

I would like to thank the authors for taking up many of my suggestions. I believe the clarifications in the text and the addition of the novel stimulus-specific analyses greatly strengthened the paper and the conclusions that can be drawn from the results. Nevertheless, I have some remaining reservations.

It is correct that the analyses in Figure 4D were carried out at the image level. Nevertheless, in my understanding it is impossible to tell whether these effects are image-specific. This is a small correction to my original assessment where I said these results were condition-specific, while in fact they simply do not allow distinguishing between condition-specific and image-specific effects. The goal of this analysis is to show differences in the image-specific effects between the three conditions. Using the dissimilarity matrices in 4B and the analyses of the authors in 4C, then indeed the expected dissimilarity matrix for image-specific effects would exhibit low dissimilarity everywhere within a condition. This dissimilarity would then be expected to be different for each condition and change across time. This is the result the authors showed in 4D. However, for condition-specific effects the expected dissimilarity matrix would show low dissimilarity within condition, as well. The only way to tell these apart is by comparing the dissimilarity within image to the dissimilarity between. Since the authors already conducted this analysis, it is just a matter of clarification that the results in 4D cannot distinguish between condition-specific and image-specific results.

Regarding the MEG-fMRI fusion modeling, I agree with the authors that the model is a choice the author has to make, but it has to be both justified and consistent. For the former, perhaps it would make sense to spell out the construction of the models in some more detail. For the latter, if the authors do not want to make adjustments, I would suggest discussing those limitations. The "recognition" model assumes that (1) Mooney images in the Pre-phase are not recognized and are all similarly "unrecognized", (2) all recognized stimuli are different from each other, and (3) all recognized stimuli are as different from each other as they are from unrecognized images. I can follow (1). However, as I mentioned, for (2) according to the authors' interpretation the "diagonals of between-condition squares" should be similar to each other. Without this, the model is inconsistent. The authors argued they cannot predict the exact value expected for those cells. However, since they are using the Spearman correlation, they would just need to choose if the dissimilarity is lower than within Pre (i.e. the blue square), the same, or higher. If the authors cannot decide, they could leave out those cells and remain agnostic. Note that, however, the same issue of comparability arises regarding (3): it seems like an even stronger assumption that in terms of recognition all recognized images are as different from each other as they are from unrecognized images. The authors should either be explicit about this, adjust the model, or remove those cells from the analysis.

https://doi.org/10.7554/eLife.41861.014

Author response

The reviewers have indicated that the topic of the paper, namely identifying the temporal dynamics underlying the Mooney recognition effect and controlling for non-content-specific effects such as increased attention, salience or decreased task difficulty is very valuable and novel. However, the reviewers remain cautious about whether the data presented conclusively allows to address these claims.

1) The paper claims that the results are image-specific, but Figure 4D shows condition-specific results. In particular, Figure 4D only shows that recognition decreases between stimulus pattern dissimilarity. But this effect can be driven by many factors, e.g. decreased noise.

Firstly, because we showed that between-image dissimilarity is higher for post- than pre- images, we believe the above comment meant to say “Figure 4D only shows that recognition increases between stimulus pattern dissimilarity. But this effect can be driven by many factors, e.g. increased noise.”

Second, we would like to point out that for the analysis shown in Figure 4D, we had already conducted a control analysis using Euclidean distance and cross-validated Euclidean distance. The results of this analysis are presented in Figure 4—figure supplement 2, which may have been overlooked. Importantly, cross-validated Euclidean distance is a metric that is unaffected by noise in the data due to cross-validation (Guggenmos et al., 2018). Specifically, it is calculated as:

dEuclidean,c.v.2x,y=x-yA(x-y)[B]T, (1) where A and B denote the two partitions of the data within each cross-validation fold. (For this analysis, we used a 3-fold cross-validation scheme.) This way, noise in the data cancels out, and the cross-validated Euclidean distance is only driven by signal, that is, the component of the data that is consistent across partitions. As explained by Guggenmos et al., this approach “can improve the reliability of distance estimates when noise levels differ between measurements” (P. 438).

As expected, this control analysis using cross-validated Euclidean distance yielded very similar result to that shown in Figure 4D. The method for this analysis was previously tucked away in the legend to Figure 4—figure supplement 2. We have now included detailed method in the manuscript (Materials and methods, subsection “Euclidean distance”) and have explained the rationale and interpretation of this control analysis better in Results (subsection “Disambiguation increases across-image dissimilarity in neural dynamics”).

Third, to further boost the claim for content-specific effects, we have now conducted a stimulus-specific decoding analysis using the “correlation (within-between)” metric as suggested by reviewer #4. This analysis allowed us to test, at the single-trial level, how well we can distinguish neural activities elicited by a Mooney image presented before vs. after disambiguation, and how well we can distinguish a pre- or post-disambiguation Mooney image from its corresponding Grayscale image. We were very pleased by the clarity and the robustness of this finding, and are deeply grateful to reviewer #4 for suggesting this analysis. The results, presented in a newly added figure (Figure 5), show that within a ~300 ms window after stimulus onset, a Mooney image – whether presented before or after disambiguation – is well separable at the single-trial level from its corresponding grayscale image. By contrast, after 500 ms post-stimulus-onset, a post-disambiguation Mooney image is entirely indistinguishable from its matching grayscale image at the single trial level despite differences in stimulus features (but well separable from other grayscale images, Figure 5D), while a pre-disambiguation Mooney image is well separable from the same image presented after disambiguation or its matching grayscale image (Figure 5C). These results reveal an image-specific shift in neural representation toward the relevant prior experience.

In a complementary analysis, we quantified the strength of the diagonals of between-condition squares (referred to as “off-diagonal elements for the same image” by reviewer #4) as compared to the off-diagonal elements in the same between-condition squares, again using the “correlation (within-between)” measure (Figure 5D). This analysis quantifies the similarity between the neural representation of the same/matching image in different perceptual conditions above and beyond its similarity to other images (e.g., is post-Image-A represented more similarly to pre-Image-A than to pre-Image-B)? The results, presented in Figure 5D, show that neural activities reflect similarities of stimulus features in an early (<500 ms) time window and similarities of recognition content in a late (>500 ms) time period.

We would like to refer the editors and reviewers to the newly added text in the Results (subsection “Comparing image-specific dynamic neural representations across perceptual conditions at the single-trial level”), Discussion (fifth paragraph), Materials and methods (subsection “Single-trial separability”), and the new Figure 5 for further details. We believe that these new observations provide strong evidence for content-specific neural effects – encoding of stimulus input in the early (<300 ms) time period and of recognition content in the late (>500 ms) time period, which further strengthens our main conclusions.

2) Figure 4E shows a significantly positive correlation between the representational similarity structure for recognized Mooney images and unambiguous images, but not that this effect is greater for post vs. pre Mooney images. Therefore, this effect cannot be conclusively related to Mooney image disambiguation.

In the analysis shown in Figure 4E, we had indeed probed the correlation between the representational geometry of pre-disambiguation Mooney images and grayscale images. As we mentioned in the text, “Correlations of Pre-Pre and Gray-Gray squares of the RDM were not significant at any time point.” Since no significance was found, we chose not to include this trace in Figure 4E to avoid cluttering the figure (the full figure is included as Author response image 1). This result contrasts with the correlations between Post-Post and Gray-Gray squares, which yielded sustained significant (p < 0.05, cluster-based permutation test) clusters after 500 ms (Figure 4E, magenta). Likely due to insufficient statistical power, a direct contrast between r(Pre-Pre, Gray-Gray) and r(Post-Post, Gray-Gray) did not yield significant clusters following correction for multiple comparisons using a cluster-based permutation test. Nonetheless, we believe that this analysis is valuable, given the sustained significance in the late (>500 ms) time period for the recognition-based representation (correlating representational geometry between post-Mooney and grayscale images), and the lack of significance for the control analysis (correlating representational geometry between pre-Mooney and grayscale images).

Author response image 1
Same as Figure 4E, now showing element-wise correlations between the Pre-Pre and Gray-Gray triangles in the RDM as the black dashed line.

No significant cluster was found for this comparison at a level of p < 0.05 (cluster-based permutation test).

In sum, we believe that our results, including the new addition, present strong evidence for content-specific neural effects related to stimulus processing and subjective recognition during prior-guided visual perception, and reveal their respective time courses.

Reviewer #1:

This paper is on perceptual processing with respect to prior knowledge. They use Mooney images, (binary images without recognizable content), which after one has seen the underlying grey scale image, are easily identified. This is a very powerful perceptual effect and allows the investigation of how prior information affects recognition. The paper uses MEG (and fMRI) in combination with a series of clever time resolved decoding approaches to show "time-courses of dissociation".

In addition they used representational similarity analysis (RSA) to show the time-course of similarities between pre and post (same physical stimulus) and post and gray (same percept i.e. recognition). Not too surprisingly, they show that these time resolved similarities differ, with the stimulus based similarities peaking early and the recognition based similarities peaking later.

Finally, they employ a powerful model based RSA approach where they investigate the commonalities of RSA based on MEG (as before), fMRI and a theoretically predicted RSA (i.e. a model). The model can incorporate recognition (high similarity between post and gray) etc. Importantly, by looking for commonalties across MEG and fMRI, they can, based on pattern similarity, fuse fMRI and MEG. Although clever and informative a similar approach has already been published (visual object recognition) (Cichy et al., 2014).

The contribution of the present study is not in methodology development, but in using recently developed methods for probing multivariate neural representations in dynamic, whole-head MEG signals and merging neural data across modalities at an informational level (e.g., Kriegeskorte et al., 2008; Cichy et al., 2014; Hebart et al., 2018; Guggenmos et al., 2018) to reveal neural mechanisms underlying prior knowledge’s influence on visual perception and recognition. Previous studies applying these methods have used images depicting clear, high-contrast, isolated objects, where the prior knowledge invoked by the images (e.g., the knowledge of cows as an animal category) was solidified typically decades ago during development. By contrast, our paradigm allows the establishment of a prior knowledge de novo in an extremely fast and robust manner; this allows us to probe and contrast visual perception/recognition without vs. with prior knowledge and reveal the time courses of the involved neural computations. The distinction of our results from previous findings is also underscored by different latencies of the identified neural effects: we find recognition-related neural effects with temporal latencies (>500 ms following stimulus onset) much later than most previously reported neural effects related to object recognition using MEG (typically within 500 ms). This difference and the related considerations were addressed in the Introduction (third paragraph) and Discussion (sixth paragraph). We now better explain our topic of investigation and its broader significance in the opening paragraph of Discussion:

“Despite the pervasive need to resolve stimulus ambiguity (caused by occlusion, clutter, shading, and inherent complexities of natural objects) in natural vision (Olshausen and Field, 2005) and the enormous power that prior knowledge acquired through past experiences wields in shaping perception (Helmholtz, 1924; Albright, 2012), the neural mechanisms underlying prior-guided visual recognition remain mysterious. Here, we exploited a dramatic visual phenomenon, where a single exposure to a clear, unambiguous image greatly facilitates recognition of a related degraded image, to shed light on dynamical neural mechanisms that allow past experiences to guide recognition of impoverished sensory input.”

Although the presented data are very interesting and show what can be done with a clever multivariate methods, including model-based RSA analyses, the promise of the title "Neural dynamics of visual ambiguity resolution by perceptual prior" is not fulfilled by this paper. Potentially, this data could give us some insights on how the integration of prior and incoming visual information works. This is only vaguely addressed, e.g. by data shown in Figure 3C.

In addition, one could argue that the novelty of this paper is only incremental: In a previous paper by Cichy et al., 2014, and a subsequent paper by Hebart describing a similar approach using model based RSA (Hebart et al., 2018) similar results were obtained. They studied object recognition, which is also based on prior information (volunteers know the objects and have seen them before in a different manner), although there is no control condition (i.e. identical visual stimulus, but different percept) as in a Mooney faces experiment.

The current paper should either provide more information about the neural dynamics of visual ambiguity resolution or at least explain how their approach adds novel insights over and above the papers mentioned above.

We hope that the above responses to the reviewer’s overall assessment and the editors’ comments have sufficiently addressed these concerns.

Reviewer #2:

[…] The methodology is without doubt advanced and the general question that this study is supposed to address is of wide general interest. However, my first main concern is that the authors do not introduce a specific hypothesis nor outline exactly how this research is going bring us closer to understanding how experience guides recognition. As a result, the study, although being informative, comes across as "fishing expedition".

We respectfully disagree with this characterization of our study. As we presented in Introduction, the current study indeed tests a specific hypothesis:

“Previous neuroimaging studies have observed that disambiguation of Mooney images induces widespread activation and enhanced image-specific information in both visual and frontoparietal cortices. […] Together with a recent finding of altered content-specific neural representations in frontoparietal regions following Mooney image disambiguation, these observations raise the intriguing possibility that slow, long-distance recurrent neural dynamics involving large-scale brain networks are necessary for prior-experience-guided visual recognition.”

Another major concern is that the authors do not report statistically solid univariate findings, which makes it impossible to relate the findings reported here to previous imaging studies employing a similar paradigm. The authors do present SVM weight-maps. However, these maps are anecdotal at best as they are not statistically evaluated in any way. Reporting univariate fMRI data would also be extremely valuable, as it would for example enable readers to assess how the MEG-fMRI modeling results relate to fMRI response amplitude (and SNR).

Statistically solid univariate findings using the fMRI data set have already been reported in a previous publication (Gonzalez-Garcia et al., 2018). Since the strength of MEG is in the temporal domain, not the spatial domain, and due to volume conduction, we believe that a massive univariate analysis using the MEG data set in the context of the present study would be superfluous. This is also not in line with most recent studies employing multivariate analyses applied to MEG data (e.g., Carlson et al., 2013; Cichy et al., 2014; Hebart et al. 2018, referenced by reviewers #1 and #2).

Furthermore, the authors make a claim that is not fully supported by their findings: they state that "This analysis showed that image-specific information for post-disambiguation Mooney images rises higher than their pre-disambiguation counterparts starting from ~500 ms" based on finding iii. This is misleading, because greater between-image pattern distances do not directly imply greater stimulus information. This finding could, for example, also be explained by noisier responses for recognized Mooney images.

Please see our response to the editor’s point #1 above.

Another issue is that a crucial test is missing related to finding iv (Figure 4E). This finding implies that recognizing Mooney images causes representational geometry (MEG based) to become more similar to that for the corresponding set of gray-scale images. However, the authors need to demonstrate that this increase in RDM-RDM similarity is significantly greater for the Post RDM as compared to the Pre RDM.

Please see our response to the editor’s point #2 above.

Finally, I don't see why finding ii is of interest (Figure 3D). To me it is unclear what sets this case of (MEG) pattern information persistence apart from previous reports if this phenomenon (e.g. Carlson et al., 2013), and how it functionally relates to experience-driven recognition.

Carlson et al. 2013 is no doubt a classic in the literature. However, it addresses a distinct question from the current study: dynamical neural mechanisms underlying recognition of clear, high-contrast color images of isolated objects (Carlson study) vs. dynamical neural mechanisms underlying experience-guided recognition of degraded, black-and-white images of objects embedded in scenes, where recognition without prior experience is extremely difficult (our study). In addition, the cross-decoding result in Carlson et al. (Figure 6A therein) did not show “pattern information persistence”, but rather transient effects that were fast changing over time – as shown by the diagonal pattern of significant decoding which contrasts with the rectangular pattern in our Figure 3D. (But again, these two analyses are asking very different questions in two studies that have different aims.)

What this analysis (Figure 3D) shows is that neural activity patterns distinguishing perceptual stage (pre- vs. post-disambiguation) are relatively sustained over time. Although this analysis is not content-specific (as we clearly acknowledge in the manuscript), it sets up the stage for the content-specific analyses and results presented thereafter.

Given these issues, I do not recommend publication of this manuscript in its current state.

We hope that we have satisfactorily addressed the reviewer’s concerns.

Reviewer #3:

[…] At the same time, I believe the authors make some claims not supported by the data. They highlight that part of the novelty of their work has to do with the fact that previous work on this topic did not reveal image-specific results and, indeed, the authors do report image-specific findings in Figure 4E. However, in contrast to the authors' claim, the other effects using RSA are likely not stimulus-specific. For example, the results in Figure 4D are averaged across stimuli, leading to condition-specific effects. To achieve stimulus-specific effects, the authors would have to either identify the similarity for the same stimulus to itself or identify the difference between same stimulus and different stimulus within different periods of the experiment. They could do this by carrying out a split-half analysis and calculating the difference (within – between). This would be equivalent to a stimulus-specific decoding analysis. I think this kind of analysis would be useful to support their results. Alternatively, the authors may want to adjust this description of their results with respect to stimulus-specificity in the Materials and methods, Results, and Discussion.

Please see our response to the editors’ point #1 above. We are deeply grateful to the reviewer for suggesting the split-half analysis using “correlation (within-between)” metric, the results of which are included in the newly added Figure 5 and described in a newly added Results subsection “Comparing image-specific dynamic neural representations across perceptual conditions at the single-trial level”). Related methods are described in Materials and methods, subsection “Single-trial separability”. Although we did not use a split-half analysis exactly, we performed such an analysis at the single-trial level to calculate the “correlation (within – between)” metric.

We also note that we think the results in Figure 4D were indeed image-specific instead of condition-specific. This is because the time courses in Figure 4D were averaged across individual image-pairs within each perceptual condition, where each value quantifies the dissimilarity between neural activity patterns related to those two individual images. Thus, this analysis shows that neural activity patterns elicited by post-disambiguation Mooney images are more distinct from each other than those elicited by the same images presented pre-disambiguation. In a control analysis (shown in Figure 4—figure supplement 2, now better explained in Results, subsection “Disambiguation increases across-image dissimilarity in neural dynamics”, and Materials and methods, subsection “Euclidean distance”), the results in Figure 4D were reproduced using cross-validated Euclidean distance, which is only contributed by signal components that are consistent across separate partitions of the data (hence, suppressing the contribution of random noise). Nonetheless, since this analysis was not conducted at the level of single trials, we have now removed the term “image-specific information” when describing this analysis (in the Abstract, title and concluding paragraph in the corresponding Results section).

A similar argument could be made regarding the model-based MEG-fMRI fusion results. The stimulus-specific model focuses on gross differences between Mooney images and greyscale images, rather than individual images. The recognition-specific model assumes that images post-recognition all become different from each other, which would lead to high dissimilarity. However, in line with the authors' interpretation of their prior work (Gonzalez-Garcia et al., 2018), one could also argue that they should become more similar to each other (when treated as the class of objects rather than individual images). In addition, their model interpretation would assume that the image itself should at least become more similar to itself, i.e. according to their interpretation, in my understanding the model would have to contain off-diagonal elements for the same image between grayscale and post-recognition periods.

We do not understand the reviewer’s comment “in line with the authors' interpretation of their prior work (Gonzalez-Garcia et al., 2018), one could also argue that they should become more similar to each other (when treated as the class of objects rather than individual images).” In this previous paper, we showed that dissimilarity between neural representation of individual images increases substantially after disambiguation, i.e., they became more different from each other. This was shown in several ways in that paper: i) overall redder hues of the Post-Post square than the Pre-Pre square of the RDMs (Figure 3B and 4A); ii) larger distances between dots representing individual images in the Post stage than the Pre stage in the multidimensional scaling (MDS) plots (Figure 3C and 4B); iii) statistical summary results (Figure 3D and 4C, cyan brackets).

The “off-diagonal elements for the same image between grayscale and post-recognition periods” (i.e., “the diagonal elements of between-condition squares” in our terminology, which we think is more accurate) have been systematically and comprehensively probed in the analysis described in the new Figure 5 (see our reply to the previous comment). This analysis quantifies these diagonals and compares them between each other (Figure 5C) and to the off-diagonal elements in the same between-condition squares (Figure 5D) using the empirical MEG data. We think this is a superior approach to including such diagonals in the between-condition squares of the model RDM. This is because the models were designed to probe relatively coarse effects (as we acknowledge in the manuscript), and it would be hard to know what arbitrary value to set such between-condition diagonals within each model, which contained binary values capturing coarser effects. All models are intended to capture certain aspects in the data (“all models are wrong, some are useful”). In fact, we think that it is wonderful (but not an a priori given) that this model-based fMRI-MEG fusion analysis probing relatively coarse effects yielded stimulus- and recognition-related neural activity time courses that are very much consistent with the earlier content-specific analyses applied to the MEG data alone (Figures 4 and 5). But, of course, the model-based fusion analysis provided further insight into the spatial dimension by bringing in the fMRI data.

To strengthen their conclusions, I would suggest the addition of stimulus-specific or at least category-specific (e.g. animate – inanimate) decoding analyses. Further, I would suggest carrying out a category-specific analysis (e.g. animate – inanimate) to confirm the claims that the results are indeed recognition-related.

We hope that we have satisfactorily addressed the concerns regarding stimulus-specific effects (summarized in our response to the editors’ point #1). We strongly believe that the current results provide very clear and very robust content-specific recognition-related effects: for example, we show that a post-disambiguation Mooney image is indistinguishable at the single-trial level from its matching grayscale image from ~500 ms onward (Figure 5C, dark blue) but well separable from other grayscale images (Figure 5D, dark blue), while in the same time period a pre-disambiguation Mooney image is well separable at the single-trial level from the same image presented post-disambiguation or from its matching grayscale image (Figure 5C, orange and green). These results reveal an image-specific shift in neural representation toward the relevant prior experience that guides perception. Furthermore, our previously included control analysis for Figure 4D, using cross-validated Euclidean distance (Figure 4—figure supplement 2), demonstrated that the increase in between-image dissimilarity following disambiguation was not driven by changing levels of noise in the data.

Given that we have shown clear content-specific effects at the level of individual images, we respectfully think that a category-level analysis is beyond the scope of this study.

While, as mentioned above, the addition of a control analysis is great, it only makes up a fraction of the other conditions. Therefore, the absence of decoding or RSA effects may be due to reduced power. What would the equivalent analysis look like for the experimental data if it is similarly reduced in size?

We have performed a control analysis for Figure 3A-B which matches the statistical power between real and catch image sets. Since there were only 6 catch image sets, we randomly selected 6 real image sets and re-conducted decoding of presentation stage (Pre- vs. Post-disambiguation). After matching the statistical power of catch image sets, we still obtained significant decoding of presentation stage using the SCP band, as shown in Author response image 2A. Qualitatively similar results were obtained using the ERF band (Author response image 2B). Due to computational intensiveness, we did not perform cluster-based permutation test for this analysis.

Author response image 2
Same as Figure 3A-B (black and green traces), except that 6 randomly selected real image sets were used to match the statistical power of catch image sets.

Results from 10 such randomly selected subsets were averaged together, and mean and s.e.m. of decoding accuracy across subjects are plotted for both real (green) and catch (black) image sets. Horizontal bar indicates significant difference from chance level (p < 0.05, FDR corrected). Catch results are identical as in Figure 3A-B.

Since changes in statistical power would only systematically affect decoding accuracy (such as in Figure 3A-B), but not estimation of the mean (such as in Figure 4D and Figure 4—figure supplement 1), we did not perform a similar control analysis for Figure 4D. In other words, if we select 6 random real image sets and re-compute Figure 4D, we would not expect any systematic change in the result.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below: As you will see below there are remaining issues with Figure 4E as outlined in the comment by reviewer #2. The new Figure 5 needs further clarifications for the reader and there is one outstanding issue related to how exemplar i.e. image specific these findings are. Finally the model needs either adjustment or a justification in the Discussion section (see comment by reviewer #3).

We are grateful to the editors and reviewers for the favorable evaluation of our previous revision and the additional helpful suggestions. We have thoroughly further revised the manuscript in line with the editors’ and reviewers’ comments. Major changes include:

- We have significantly toned down the interpretations and conclusions derived from Figure 4E, and now specifically state that this analysis only provides qualitative evidence (albeit from a very unique angle and yielding findings that are consistent with all the other analyses), which is quantitatively assessed by the ensuing model-driven data fusion analysis.

- In response to both reviewers’ comments suggesting that Figure 4D leads naturally to Figure 5, we have now swapped the order of the Results sections related to Figure 4E and Figure 5, such that the revised text describes results in the following sequence: Figure 4D → Figure 5 → Figure 4E → Figure 6. Although unconventional in terms of figure sequence, we think that this order fits better with the logical flow of the analyses, such that the questions opened up by Figure 4D are answered by Figure 5, and the qualitative evidence provided by Figure 4E is strengthened by Figure 6.

- We have revised the models in line with reviewer #4’s suggestions. The results obtained with these updated models are consistent with our previous findings but show stronger neural effects. As a result, Figure 6 as well as Figure 6—figure supplement 1 have been updated, and the related text has been thoroughly revised.

We believe that our revision has fully addressed all of the editors’ and reviewers’ remaining concerns and the manuscript has been further strengthened as a result. Please find below a point-by-point response to the editors’ and reviewers’ comments.

Reviewer #2:

I am glad to see that the authors have found an elegant way of addressing my most important issue – the lack of direct support for the claim of recognition driven content-specific effects. I think that this issue is now covered by the results shown in Figure 5.

We are pleased that the reviewer found the analysis reported in Figure 5 satisfactory.

I still would like to insist on a revision of Figure 4E. The pink line in the figure is referred to as a "recognition-based" time course. This is misleading as the difference between the pink and the green line can be explained by differences in representational geometry due to recognition and due to differences related to presenting a Gray vs. Mooney images. Hence, there is no (significant) evidence for this effect being more pronounced for the Post-Gray than the Pre-Gray comparison. Therefore, the authors should explicitly clarify that the difference shown in this figure cannot be conclusively attributed to recognition. They could also simply omit 4E as the content specific recognition effect is now directly demonstrated in Figure 5. The authors could also consider to conceptually link Figure 4 and 5 by clarifying that Figure 4D leaves open the question if recognition-driven enhanced (and equal-to Gray) representational dissimilarity for Mooney images is driven by activation patterns to Mooney images becoming more similar to their Grayscale counterparts. At present, a compelling motivation for the Figure 5 analysis is missing at the start of the corresponding Results section.

We have now significantly toned down and qualified the interpretations of Figure 4E, and present it as providing qualitative evidence consistent with the other analyses. Below we reproduce the most relevant text:

“As a control measure, the correlation between Pre-Pre and Gray-Gray squares of the RDM was not significant at any time point, suggesting that, as expected, representational geometry is different between conditions with different stimulus input and different recognition outcomes. […] In the final analysis presented below, we will quantitatively test this possibility using a model-driven MEG-fMRI fusion analysis that simultaneously elucidates the spatial dimension of the evolving neural dynamics.”

As mentioned above, we have also moved the Results section related to Figure 4E later, to provide a more smooth and direct transition between Figure 4D and Figure 5:

“This dramatic effect raises two important questions: 1) Is this effect driven by the neural representations of Mooney images shifting towards those of their respective grayscale counterparts? […] To answer these questions, we next probe how neural representation for a particular Mooney image changes following disambiguation at the single-trial level.”

We believe that Figure 4E is a valuable analysis to retain in the manuscript, since it is the only analysis that probes how the fine-grained representational geometry (the set of representational distances across all image pairs) compares between conditions. It shows that the representational geometry is significantly similar between Pre and Post conditions in an early time window (<300 ms), and between Post and Gray conditions in a late time window (>600 ms). In addition, the control analysis of comparing between Pre and Gray conditions did not yield any significant time point, as expected. We acknowledge that the result is not as strong as one would like, since a direct contrast between Post-Gray and Pre-Gray comparisons did not yield significance after correcting for multiple comparisons using cluster-based permutation test. Given these considerations, we have opted to leave the analysis in and present it as qualitative evidence.

With respect to my introduction-related comment, I still think that the clarity of the paper would be enhanced by including a more specific hypothesis. As noted in the rebuttal, this presently boils down to this sentence: "these observations raise the intriguing possibility that slow, long-distance recurrent neural dynamics involving large-scale brain networks are necessary for prior-experience-guided visual recognition." I find it hard to see how this rather broad and vague hypothesis is a natural motivation for the specific research preformed during this study nor how it is precisely addressed by the findings.

In the sentence quoted by the reviewer, we have now clarified that “slow” refers to “taking longer than 500 ms”. We believe that this is actually a very specific hypothesis, given that previous studies have typically reported recognition-related neural activity that concludes within 500 ms, as we previously stated in the same paragraph (reproduced below):

“By contrast, neural dynamics underlying recognition of intact, unambiguous images, as well as scene-facilitation of object recognition, typically conclude within 500 ms (Carlson et al., 2013; van de Nieuwenhuijzen et al., 2013; Kaiser et al., 2016; Brandman and Peelen, 2017). Together with a recent finding of altered content-specific neural representations in frontoparietal regions following Mooney image disambiguation (Gonzalez-Garcia et al., 2018), these observations raise the intriguing possibility that slow (taking longer than 500 ms), long-distance recurrent neural dynamics involving large-scale brain networks are necessary for prior-experience-guided visual recognition.”

We also note that not all valuable research derives from testing specific hypotheses and data-driven analyses are equally important for uncovering behaviorally relevant patterns in large, complex neural data sets without a priori biases. Some of our analyses may lie between hypothesis-driven and data-driven extremes, as we note in Introduction:

“To unravel neural mechanisms underlying prior experience’s influence on perception, an important unanswered question is how different information processing stages are dynamically encoded in neural activities.”

Reviewer #3:

I would like to thank the authors for taking up many of my suggestions. I believe the clarifications in the text and the addition of the novel stimulus-specific analyses greatly strengthened the paper and the conclusions that can be drawn from the results. Nevertheless, I have some remaining reservations.

It is correct that the analyses in Figure 4D were carried out at the image level. Nevertheless, in my understanding it is impossible to tell whether these effects are image-specific. This is a small correction to my original assessment where I said these results were condition-specific, while in fact they simply do not allow distinguishing between condition-specific and image-specific effects. The goal of this analysis is to show differences in the image-specific effects between the three conditions. Using the dissimilarity matrices in 4B and the analyses of the authors in 4C, then indeed the expected dissimilarity matrix for image-specific effects would exhibit low dissimilarity everywhere within a condition. This dissimilarity would then be expected to be different for each condition and change across time. This is the result the authors showed in 4D. However, for condition-specific effects the expected dissimilarity matrix would show low dissimilarity within condition, as well. The only way to tell these apart is by comparing the dissimilarity within image to the dissimilarity between. Since the authors already conducted this analysis, it is just a matter of clarification that the results in 4D cannot distinguish between condition-specific and image-specific results.

We agree. As mentioned above, we have now re-ordered the text sections related to Figure 4E and Figure 5, such that the presentation of Figure 4D is followed by Figure 5 in the Results section. We have further added a paragraph at the end of Figure 4D section (subsection “Disambiguation increases across-image dissimilarity in neural dynamics”) to discuss the limitations of this analysis and provide a better transition to the analysis presented in Figure 5.

Regarding the MEG-fMRI fusion modeling, I agree with the authors that the model is a choice the author has to make, but it has to be both justified and consistent. For the former, perhaps it would make sense to spell out the construction of the models in some more detail. For the latter, if the authors do not want to make adjustments, I would suggest discussing those limitations. The "recognition" model assumes that (1) Mooney images in the Pre-phase are not recognized and are all similarly "unrecognized", (2) all recognized stimuli are different from each other, and (3) all recognized stimuli are as different from each other as they are from unrecognized images. I can follow (1). However, as I mentioned, for (2) according to the authors' interpretation the "diagonals of between-condition squares" should be similar to each other. Without this, the model is inconsistent. The authors argued they cannot predict the exact value expected for those cells. However, since they are using the Spearman correlation, they would just need to choose if the dissimilarity is lower than within Pre (i.e. the blue square), the same, or higher. If the authors cannot decide, they could leave out those cells and remain agnostic. Note that, however, the same issue of comparability arises regarding (3): it seems like an even stronger assumption that in terms of recognition all recognized images are as different from each other as they are from unrecognized images. The authors should either be explicit about this, adjust the model, or remove those cells from the analysis.

We would like to thank the reviewer for the very helpful suggestion. Both the consideration about Spearman correlation (only the ordering of values, not their absolute values, matters) and the strategy of excluding cells where the model is agnostic were great suggestions. We have now updated both the Stimulus and the Recognition model to address the points raised by the reviewer (specifically, 2 and 3). The new results are qualitatively consistent with what we presented in the previous submission, but show stronger neural effects related to stimulus processing. Given the extensiveness of changes to the text, we do not reproduce the revised text here, but would like to refer the reviewer to the relevant Results subsection “Model-driven MEG-fMRI data fusion spatiotemporally resolves neural dynamics related to stimulus, attention, and recognition processing”, as well as the revised Figure 6.

https://doi.org/10.7554/eLife.41861.015

Article and author information

Author details

  1. Matthew W Flounders

    Neuroscience Institute, New York University Langone Medical Center, New York, United States
    Contribution
    Conceptualization, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7014-4665
  2. Carlos González-García

    Department of Experimental Psychology, Ghent University, Ghent, Belgium
    Contribution
    Formal analysis, Funding acquisition, Investigation, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6627-5777
  3. Richard Hardstone

    Neuroscience Institute, New York University Langone Medical Center, New York, United States
    Contribution
    Software, Formal analysis, Validation, Visualization, Methodology, Writing—original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7502-9145
  4. Biyu J He

    1. Neuroscience Institute, New York University Langone Medical Center, New York, United States
    2. Department of Neurology, New York University Langone Medical Center, New York, United States
    3. Department of Neuroscience and Physiology, New York University Langone Medical Center, New York, United States
    4. Department of Radiology, New York University Langone Medical Center, New York, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Validation, Methodology, Project administration, Writing—review and editing
    For correspondence
    biyu.jade.he@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1549-1351

Funding

National Institute of Neurological Disorders and Stroke

  • Biyu J He

Klingenstein-Simons Neuroscience Fellowship

  • Biyu J He

U.S. Department of State (The Fulbright Program)

  • Carlos González-García

National Science Foundation (BCS-1753218)

  • Biyu J He

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research was supported by the Intramural Research Program of the National Institutes of Health/National Institute of Neurological Disorders and Stroke, and National Science Foundation (BCS-1753218, to BJH). BJH further acknowledges support by Klingenstein-Simons Neuroscience Fellowship. CGG was supported by the Department of State Fulbright program. We thank Brian Maniscalco and Tom Holroyd for helpful discussions on code implementation and data acquisition, respectively.

Ethics

Human subjects: The experiment was approved by the Institutional Review Board of the National Institute of Neurological Disorders and Stroke (under protocol #14-N-0002). All subjects provided written informed consent.

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany

Reviewer

  1. Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany

Publication history

  1. Received: September 9, 2018
  2. Accepted: February 25, 2019
  3. Accepted Manuscript published: March 7, 2019 (version 1)
  4. Version of Record published: March 13, 2019 (version 2)

Copyright

© 2019, Flounders et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,548
    Page views
  • 241
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)