1. Neuroscience
Download icon

Recurrent processes support a cascade of hierarchical decisions

  1. Laura Gwilliams  Is a corresponding author
  2. Jean-Remi King
  1. Department of Psychology, New York University, United States
  2. NYU Abu Dhabi Institute, United Arab Emirates
  3. Frankfurt Institute for Advanced Studies, Germany
  4. Laboratoire des Systèmes Perceptifs (CNRS UMR 8248), Département d’Études Cognitives, École Normale Supérieure, PSL University, France
Research Article
  • Cited 0
  • Views 1,297
  • Annotations
Cite this article as: eLife 2020;9:e56603 doi: 10.7554/eLife.56603

Abstract

Perception depends on a complex interplay between feedforward and recurrent processing. Yet, while the former has been extensively characterized, the computational organization of the latter remains largely unknown. Here, we use magneto-encephalography to localize, track and decode the feedforward and recurrent processes of reading, as elicited by letters and digits whose level of ambiguity was parametrically manipulated. We first confirm that a feedforward response propagates through the ventral and dorsal pathways within the first 200 ms. The subsequent activity is distributed across temporal, parietal and prefrontal cortices, which sequentially generate five levels of representations culminating in action-specific motor signals. Our decoding analyses reveal that both the content and the timing of these brain responses are best explained by a hierarchy of recurrent neural assemblies, which both maintain and broadcast increasingly rich representations. Together, these results show how recurrent processes generate, over extended time periods, a cascade of decisions that ultimately accounts for subjects’ perceptual reports and reaction times.

Introduction

To process the rich sensory flow emanating from the retina, the brain recruits a hierarchical network originating in the primary visual areas and culminating in the infero-temporal, dorso-parietal and prefrontal cortices (Hubel and Wiesel, 1962; Maunsell and van Essen, 1983; Riesenhuber and Poggio, 1999; DiCarlo et al., 2012).

In theory, the feedforward recruitment of this neural hierarchy could suffice to explain our ability to recognize visual objects. For example, recent studies demonstrate that artificial feedforward neural networks trained to categorize objects generate similar activations patterns to those elicited in the infero-temporal cortices (Yamins et al., 2014; Schrimpf et al., 2019; Khaligh-Razavi and Kriegeskorte, 2014; Cichy et al., 2016).

However, feedforward architectures have a fixed number of processing stages and are thus unable to explain a number of neural and perceptual phenomena. For example, the time it takes subjects to recognize objects considerably varies from one trial to the next (Ratcliff and Smith, 2004). In addition, the neural responses to visual stimuli generally exceed the 200 ms feedforward recruitment of the visual hierarchy (Dehaene and Changeux, 2011a; Lamme and Roelfsema, 2000).

A large body of research shows that recurrent processing accounts for such behavioral and neural dynamics (Lamme and Roelfsema, 2000; Gray et al., 1989; Gold and Shadlen, 2007; Shadlen and Newsome, 2001; O'Connell et al., 2012; Spoerer et al., 2017; Kar et al., 2019; Kietzmann et al., 2019; Spoerer et al., 2019). In this view, recurrent processing would mainly consist of accumulating sensory evidence until a decision to act is triggered (O'Connell et al., 2012; Mohsenzadeh et al., 2018; Rajaei et al., 2019).

However, the precise neuronal and computational organization of recurrent processing remains unclear at the system level. In particular, how distinct recurrent assemblies implement series of hierarchical decisions remains a major unknown.

To address this issue, we use magneto-encephalography (MEG) and structural magnetic-resonance imaging (MRI) to localize, track and decode, from whole-brain activity, the feedforward (0–200 ms) and recurrent processes (>200 ms) elicited by variably ambiguous characters briefly flashed on a computer screen. We show that the late and sustained neural activity distributed along the visual pathways generates, over extended time periods, a cascade of categorical decisions that ultimately predicts subjects’ perceptual reports.

Results

Subjective reports of stimulus identity are categorical

To investigate the brain and computational bases of perceptual recognition, we used visual characters as described in King and Dehaene, 2014a. These stimuli can be parametrically morphed between specific letters and digits by varying the contrast of their individual edges, hereafter referred to as pixels (Figure 1A–B).

Experimental protocol and behavioral results.

Experiment 1: eight human subjects provided perceptual judgments on variably ambiguous digits briefly flashed at the center of a computer screen (A). Reports were made by clicking on a disk, where (i) the radius and (ii) the angle on the disk indicate (i) subjective visibility and (ii) subjective identity respectively. (B) Distribution (areas) and mean response (dots) location for each color-coded stimulus. (C) Top plots show the same data as B, broken down for each morph set. The x-axis indicates the expected angle given the stimulus pixels (color-coded), hereafter referred to as evidence. The y-axis indicates the angle of the mean response relative to stimulus evidence. The bottom plot shows the same data, grouped across morphs. (D) Experiment 2: seventeen subjects categorized a briefly flashed and parametrically manipulated-morph using a two-alternative forced-choice. Stimulus-response mapping changed on every block. (E) Mean reaction times as a function of categorical evidence (the extent to which the stimulus objectively corresponds to a letter). (F) Mean probability of reporting a letter as a function of categorical evidence. (G) Evoked activity estimated with dSPM and estimated across all trials and all subjects. These data are also displayed in Video 1. Error-bars indicate the standard-error-of-the-mean (SEM) across subjects.

To check that these stimuli create categorical percepts, we asked eight human subjects to provide continuous subjective reports by clicking on a disk after each stimulus presentation (Experiment 1. Figure 1A). The radius and the angle of the response on this disk indicated the subjective visibility and the subjective identity of the stimulus, respectively. We then compared (i) the reported angle with (ii) the stimulus evidence (i.e. the expected angle given the pixels) for each morph separately (e.g. 5–6, 6–8, etc). Subjective reports were categorical: cross-validated sigmoidal models better predicted subjects’ responses (r = 0.49 ± 0.05, p=0.002) than linear models (r = 0.46 ± 0.043, p=0.002, sigmoid > linear: p=0.017 Figure 1B–C).

We adapted this experimental paradigm for an MEG experiment by modifying three main aspects (Experiment 2). First, we used stimuli that could be morphed between letters and digits, to trigger macroscopically distinguishable brain responses in the visual word form area (VWFA) and number form area (NFA) (Dehaene and Cohen, 2011b; Shum et al., 2013). Second, we added two task-irrelevant flankers next to the target stimulus (Figure 1D) to increase our chances of eliciting recurrent processes via crowding (Strasburger et al., 2011; Pelli et al., 2004). Note that this assumption is based on previous work; we did not test the effect of crowding explicitly in this study. Third, a new set of 17 subjects reported subjective identity via a two-alternative forced-choice button press. The identity-response mapping was orthogonal to the letter/digit category and changed on every block of 48 trials. There were 1920 trials total, 320 of which were presented passively, were not ambiguous and did not require a response.

Perceptual reports followed a similar sigmoidal pattern to Experiment 1: performance was worse for more ambiguous trials (65%) as compared to unambiguous trials (92%, p<0.001). In addition, reaction time slightly, and consistently, increased with uncertainty (i.e. how ambiguous the stimulus is). For example, highly ambiguous stimuli were identified within 690 ms, whereas nonambiguous stimuli were identified within 624 ms (z = −21.68, p<0.001) (Figure 1E–F). Although subjects were asked to respond as quickly as possible, the observed reaction times were overall quite slow, reflecting the difficulty of the task.

Neural representations are functionally organized over time and space

Here, we aimed to decompose the sequence of decisions that allow subjects to transform raw visual input into perceptual reports. To this aim, we localized the MEG signals onto subjects’ structural MRI with dynamic statistical parametric mapping (dSPM, Dale et al., 2000), and morphed these source estimates onto a common brain coordinate (Fischl, 2012; Gramfort et al., 2014). The results confirmed that the stimuli elicited, on average, a sharp response in the primary visual areas around 70 ms, followed by a fast feedforward response along the ventral and dorsal visual pathways within the first 150–200 ms. After 200 ms, the activity appeared sustained and widely distributed across the associative cortices up until 500–600 ms after stimulus onset (Figure 1G and Video 1).

Video 1
Source-localized evoked response averaged over all trials and subjects.

Activity is plot in noise-normalized dSPM units, and shown on an inflated cortical surface (center) as well as a two-dimensional ‘glass brain’ that shows activity averaged over the transverse plane (bottom right).

To separate the processing stages underlying these neural responses, we applied (i) mass-univariate; (ii) temporal decoding and (iii) spatial decoding analyses based on the five orthogonal features varying in our study: (1) the position of the stimulus, (2) its identity, (3) its perceived category, (4) its uncertainty and (5) its corresponding button press.

Mass-univariate

First, we modeled the source localized neural responses over time and space as a function of the five predictors of interest (multivariate in feature space, univariate in source space). Regressor beta coefficients were estimated for each subject separately, then submitted to a spatio-temporal cluster test (Figure 2A, details provided in Materials and methods). For stimulus position two clusters were found. One in the left hemisphere (number of sources = 2562, mean t-value = 2.87, 20–1560 ms, p=0.001); and one in the right (number of sources = 2562, mean t-value = 2.92, 40–1560 ms, p=0.0005). Stimulus identity also elicited two clusters. One in the left hemisphere (number of sources = 1946, mean t-value = 2.54, 100–840 ms, p=0.0005); and one in the right (number of sources = 2118, mean t-value = 2.48, 120–860 ms, p=0.001). No significant clusters were found for decision, but the largest cluster ranged from 210 to 320 ms (mean t-value = 1.79, p=0.21). Uncertainty elicited two clusters. One in the left hemisphere (number of sources = 2485, mean t-value = 2.42, 280–1560, p=0.001); and one in the right (number of sources = 2319, mean t-value = 2.4, 340–1560 ms, p=0.008). Motor side resulted in two clusters. One in the left hemisphere (number of sources = 2523, mean t-value = 2.56, 280–1560, p=0.0005); and one in the right (number of sources = 2525, mean t-value = 2.66, 280–1560 ms, p=0.0005). See Figure 2—figure supplements 48 for a full display of these results.

Figure 2 with 12 supplements see all
Spatio-temporal hierarchy.

(A) Mass-univariate statistics. Each row plots the average-across-subjects beta coefficients obtained from regression between single-trial evoked activity and each of the five features orthogonally varying in this study. These results are displayed in Video 2. Colors are thresholded based on t-values that exceed an uncorrected p<0.1. We chose this threshold because the perceptual category did not exceed the significance threshold in the univariate tests. (B) Spatial-decoders, consisting of linear models fit across all time sample for each source separately, summarize where each feature can be decoded. Lines indicate significant clusters of decoding scores across subjects cluster-corrected p<0.05. (C) Temporal-decoders, consisting of linear models fit across all MEG channels, for each time sample separately, summarize when each feature can be decoded. To highlight the sequential generation of each representation, decoding scores are normalized by their respective peaks. Additional non-normalized decoding timecourses are available in Figure 2—figure supplements 1 and 2. (D) The peak and the start of temporal decoding plotted for each subject (dot) and for each feature (color). (E) The peak spatial decoding plotted for each subject (dot) and for each feature (color).

The perceived category did not yield significant results after correction for multiple comparisons. Thus, the rest of our analyses are based on multivariate techniques (univariate in feature space, multivariate in source space), which provide highly superior statistical sensitivity to our experimental manipulations. The direct comparison between the sensitivity of the mass-univariate versus multivariate approaches are shown in Figure 2.

Temporal and spatial decoding

To overcome the poor SNR of single-trial MEG responses, we next applied multivariate decoding analyses to identify when and where low-level visual features are represented in brain activity. We estimated, at each time sample separately, the ability of an l2-regularized regression to predict, from all MEG sensors, the five stimulus features of interest. Overall, the multivariate analyses were far more sensitive than the univariate tests.

Stimulus position (the location of the stimulus on the computer screen: left versus right) was decodable between 41 and 1500 ms and peaked at 120 ms (AUC = 0.94; SEM = 0.007; p<0.001 as estimated with second-level non-parametric temporal cluster test across subjects, Figure 2C). To summarize where stimulus position was represented in the brain, we implemented ‘spatial decoders’: l2-regularized logistic regressions fit across all time samples (0–1500 ms) for each estimated brain source separately. Spatial decoding peaked in early visual areas and was significant across a large variety of visual and associative cortices as estimated with a second-level non-parametric spatial cluster test across subjects (Figure 2B). Stimulus position was encoded in the timecourse of all sources, in both the left hemisphere (mean t-value = 9.42, p=0.0005) and the right (mean t-value = 9.97, p=0.0005). These signals peaked in the early visual cortex (mean MNI [x = 27.59; y = −74.15; z = −1.07]), and propagated along the ventral and dorsal streams during the first 200 ms (Figure 2A, Video 2); confirming the retinotopic organization of the visual hierarchy (Hagler and Sereno, 2006; Wandell et al., 2007).

Video 2
Temporal decoding results.

For each regressor of interest, the trajectory of normalized decoding accuracy is plot over time. The beta coefficients from the univariate spatio-temporal analysis are plot on the inflated brains, averaged over subjects. The timing of the beta coefficients corresponds to the timing of the normalized decoding accuracy, as shown in the ms counter at the bottom.

Second, we aimed to isolate more abstract representations related to stimulus identity. Stimulus identity can be analyzed either from an objective referential (what stimulus is objectively presented?) or from a subjective referential (i.e. what stimulus did subjects report having seen?). We first focus on decoding features of the stimulus that are not ambiguous, such that subjective and objective representations are confounded. To this aim, we grouped stimuli along common continua (e.g. The eight stimuli along the 4-H continuum belong to the same morph and are here considered to share a common identity) and fit logistic regression classifiers across morphs (i.e. E-6 versus 4-H). The corresponding stimulus identity was decodable between 120 and 845 ms and peaked at 225 ms (AUC = 0.59; SEM = 0.01; p<0.001). Spatial decoding revealed decodability from all sources (mean t-value = 7.9, p<0.0001). These effects peaked more anteriorly than those of stimulus position (mean MNI: x = 27.75; y = −62.75; z = −1.55; p<0.001).

Third, we aimed to isolate the neural signatures of subjective perceptual categorization and thus focus on decoding ambiguous pixels. To this aim, we grouped stimuli based on whether the subject reported a digit or a letter category. Temporal decoders weakly but significantly classified perceptual category from 150 to 940 ms after stimulus onset and peaked at 370 ms (AUC = 0.55; SEM = 0.01; p<0.001, Figure 2C). The corresponding sources also peaked in the inferotemporal cortex but more anteriorly than stimulus identity (x = 30.89; y = −35.64; z = 21.41; p<0.01). As reported above, the mass-univariate effects did not survive correction for multiple comparisons (e.g. 210–320 ms: mean t-value=1.79, p=0.21). Nonetheless, spatial decoders, which mitigate the trade-off between temporal specificity and the necessity to correct statistical estimates for multiple comparisons, showed that perceptual category was reliably decoded from a large set of brain areas (mean t-value=4.82; p<0.001; 594 significant vertices) (Figure 2B).

Importantly, when training the classifier on all active trials to distinguish letters (E/H) and digits (4/6), we could significantly (max AUC = 0.55; SEM = 0.011; p<0.01; 200–550 ms) decode this contrast for different unambiguous tokens (A/C versus 9/8) when trials were presented passively (no button press required). The time-course of responses to these passive trials was statistically indistinguishable until 350 ms. Thereafter, active trials lead to significantly higher decoding accuracy than the generalization to passive trials. This suggests that the decoders specifically track the letter/digit representation, independently of pixel arrangements (see Figure 6—figure supplement 2 for passive decoding performance).

Fourth, trial uncertainty (i.e. the objective distance between the presented stimulus and the closest unambiguous character) could be decoded between 270 and 1485 ms and peaked at 590 ms (l2-regularized Ridge regression fit across sensors, R = 0.12; SEM = 0.024; p<0.01). Uncertainty signals were localized more anteriorly than those of stimulus category (x = 12.58; y = −91.44; z = −1.23; p<0.01). While spatial decoding led to significant clusters in the temporal, parietal and prefrontal areas (mean t-value = 2.91, p=0.002) (Figure 2B), the peak location of stimulus uncertainty was highly variable across subjects and included the dorso-parietal cortex, the temporo-parietal junction and the anterior cingulate cortex (Figure 2E).

Finally, temporal decoders of subjects’ button press (left versus right index fingers) were significant from 458 ms after stimulus onset and peaked at 604 ms (AUC = 0.85; SEM = 0.011; p<0.001). A significant cluster of motor signals could be detected around sensorimotor cortices between 590 and 840 ms in the univariate analysis (mean t-value=4.98, p<0.001, Figure 2A). Spatial decoding corroborated this result (mean t-value 4.1, p<0.0001). Response-locked analyses revealed qualitatively similar but stronger results. For example, temporal decoders were significant from 350 ms prior to the response and up to 500 ms after the response reaching an AUC of 0.94 at response time (p<0.001). Response-locked decoding performance is shown in Figure 2—figure supplement 2.

Overall, the time at which representations became maximally decodable correlated with their peak location along the postero-anterior axis (Figure 2D–E) (r = 0.57, p<0.001). The specific hierarchical organization of the stimulus features was in some cases surprising. For example, the letter/digit contrast peaked remarkably late (∼400 ms). Furthermore, Uncertainty was one of the latest features to come online, and extended into responses to the subsequent trial. These results thus strengthen the classic notion that perceptual processes are hierarchically organized across space, time and function. Importantly, however, this cascade of representations spreads over more than 600 ms and largely exceeds the time it takes the feedforward response to ignite the ventral and dorsal pathways (Figure 1G and Video 1).

A hierarchy of recurrent layers explains the spatio-temporal dynamics of neural representations

The above results show that the brain sequentially generates, over an extended time period, a hierarchy of representations that ultimately account for perceptual reports.

To understand how this cascade of representations emanates from brain activity, we formalize four neural architectures compatible with the above results, and which nonetheless predict distinct spatio-temporal dynamics of each representation (Figure 3). In these models, we assume that each ‘layer’ generates new hierarchical features, in order to account for the organization of spatial decoders (Figure 2E). Furthermore, we only discuss architectures which can code for all representations simultaneously, in order to account for the overlapping temporal decoding scores (Figure 2C). Finally, we only model discrete activations (i.e. a representation is either encoded or not) as any more subtle variation can be trivially accounted for by signal-to-noise ratio considerations.

Source and temporal generalization predictions for various neural architectures.

(A) Four increasingly complex neural architectures compatible with the spatial and temporal decoders of Figure 2. For each model (rows), the five layers (L1, L2 … L5) generates new representations. The models differ in their ability to (i) propagate low-level representations across the hierarchy, (ii) maintain information with each layer in a stable or dynamic way. (B) Activations within each layer plotted at five distinct time samples. Dot slots indicate different neural assemblies within the same layer. Colors indicate which feature is linearly represented. For clarity purposes, only effective connections are plotted between different time samples. (C) Summary of the information represented within each layer across time. (D) Expected result for of the temporal generalization analyses, based on the processing dynamics of each model.

Each model predicts (1) ‘source’ decoding time courses (i.e. what is decodable within each layer) and (2) ‘temporal generalization’ (TG) maps. TG is used to characterize the dynamics of neural representations and consists in assessing the extent to which a temporal decoder trained at a given time sample generalizes to other time samples (King and Dehaene, 2014b; Figure 3D).

Our spatial and temporal decoding results can be accounted for by a feedforward architecture that both (i) generates new representations at each layer and (ii) propagates low-level representations across layers (Figure 3 Model 1: ‘broadcast’). This architecture predicts that representations would not be maintained within brain areas. However, this lack of maintenance is not supported by additional analyses of our data. First, the position of the stimulus was decodable in the early visual cortex between 80 and 320 ms (mean t-value=5.18, p<0.001) and thus longer than the stimulus presentation. Second, most temporal decoders significantly generalized over several hundreds of milliseconds (Figure 4A–B). For example, the temporal decoder trained to predict stimulus position from t = 100 ms could accurately generalize until ~500 ms as assessed with spatio-temporal cluster tests across subjects (Figure 4A). Similarly, temporal decoders of perceptual category and button-press generalized, on average, for 287 ms (SEM = 12.47; p<0.001) and 689 ms (SEM = 30.94; p<0.001), respectively. Given that the neural activity underlying the decoded representations is partially stable over several hundreds of milliseconds, recurrent connections seem necessary to account our data (Figure 4 Model 2–4).

Temporal generalization results.

(A) Temporal generalization for each of the five features orthogonally varying in our study. Color indicate decoding score (white = chance). Contours indicate significant decoding clusters across subjects. (B) Cumulative temporal generalization scores for the temporal decoders trained at 100, 200, 300, 400 and 500 ms, respectively. These decoding scores are normalized by mean decoding peak for clarity purposes. (C) Same data as A but overlaid. For clarity purposes, contours highlight the 25th percentile of decoding performance.

Consequently, we then considered a simple hierarchy of recurrent layers, where recurrence only maintains activated units (Figure 3 Model 2: ‘maintain’). This architecture predicts strictly square TG matrices (i.e. temporal decoders would be equivalent to one another in terms of their performance) and is thus at odds with the largely diagonal TG matrices observed empirically (Figure 4A). Specifically, the duration of significant temporal decoding (fitting a new decoder at each time sample) was significantly longer than the generalization of a single decoder to subsequent time samples (e.g. 1239 versus 287 ms for perceptual category (t = −61.39; p<0.001) and 1215 versus 689 ms for button-press (t = −16.26; p<0.001), Figure 4B). These results thus suggest that the decoded representations depend on dynamically changing activity: that is each feature is linearly coded by partially distinct brain activity patterns at different time samples.

It is difficult to determine, with MEG alone, whether such dynamic maintenance results from a change of neural activity within or across brain areas. Indeed, Model 1 and Model 3 can equally predict diagonal TG (Figure 3). However, these two models, and their combination (Model 4) diverge in terms of where information should be decodable. Specifically, source analyses revealed that both stimulus position and perceptual category can be decoded across a wide variety of partially-overlapping brain areas (Figure 2B, Video 2), similarly to Model 4. Nonetheless, our MEG study remains limited in assessing whether within brain regions dynamics also contribute to the diagonal TG, which would suggest a mixture between models 3 and 4.

Together, source and TG analyses thus suggest that the slow and sequential generation of increasingly abstract representations depends on a hierarchy of recurrent layers that generate, maintain and broadcast representations across the cortex.

Hierarchical recurrence induces an accumulation of delays

Can a hierarchy of recurrent processes account for the variation in single-trial dynamics? To address this issue, we hypothesized that recurrent processes would take variable amounts of time to converge to each intermediary representation. In this view, (i) each feature is predicted to propagate across brain areas at distinct moments, and (ii) the successive rise of decodable representations is thus predicted to incrementally correlate with reaction times (Figure 5A–E).

Figure 5 with 1 supplement see all
Correlation between TG peaks and reaction times.

(A, B) Recurrent processing at a given processing stage is hypothesized to take a variable amount of time to generate adequate representations. (C) According to this hypothesis, the rise of the corresponding and subsequent representations would correlate with reaction times. (D, left) Predictions when delays are only induced by the perceptual stage of processing. (D, middle) Predictions when delays are only induced by the motor processing processing stage. (D, right) Predictions when delays are induced by all processing stages. (E) TG scores aligned to training time, split into trials within the fastest and slowest reaction-time quantile and averaged across reaction times bins. Dark and light lines indicate the average decoding performance for trials with fastest and slowest reaction times respectively. (F) Each subject (dot) mean peak decoding time (y-axis) as a function of reaction time (x-axis) color-coded from dark (fastest) to light (slowest). The beta coefficients indicate the average delay estimate. (G) The average slope between processing delay and reaction time for each feature. Error-bars indicate the SEM.

To test this hypothesis, we estimated how the peak of each temporal decoder varied with reaction times. For clarity purposes, we split reaction times into four quantiles, and averaged the time courses of temporal decoders relative to their training time. This method allowed us to assess the extent to which neural processes are ‘sped up’ or ‘slowed down’ relative to the average processing speed, as represented by the diagonal axis (Figure 5—figure supplement 1 summarizes this method). These analyses showed that the latencies of (i) perceptual category (r = 0.35; p=0.006), (ii) stimulus uncertainty (r = 0.37; p=0.004) and (iii) button press (r = 0.66; p<0.001) increasingly correlated with reaction times (Figure 5F–G).

Overall, these results show that we can track with MEG, a series of decisions generated by hierarchical recurrent processes. This neural architecture partially accounts for subjects’ variable and relatively slow reaction times.

Hierarchical recurrence implements a series of all-or-none decisions

An architecture based on successive decisions predicts a loss of ambiguous information akin to all-or-none categorization across successive processing stages (Figure 6A). To test this prediction, we quantified the extent to which the decoding of ‘percept category’ and of ‘motor action’ varied linearly or categorically with (i) categorical evidence and (ii) motor evidence respectively (i.e. the extent to which the stimulus (i) objectively looks like a letter or a digit and (ii) should have led to a left or right button press given its pixels). Note that due to the limitations of our experimental design, we can only assess the effect of stimulus evidence at these two level of representations: stimulus evidence only varied as a function of decision and motor response but was orthogonal to stimulus position and stimulus identity.

Figure 6 with 2 supplements see all
Motor and perceptual decisions.

(A) Hypothesis space for when responses become categorical: during sensory, perceptual or motor processing. (B, top) Time course of decoding the perceptual decision. (B, bottom) Classifier predictions split into different levels of sensory evidence. (C) Averaging probabilities in different time-windows shows the linear-categorical shift in how information is represented. (D, top) Time course of decoding the motor decision. (D, bottom) Splitting classifier predictions into different levels of uncertainty. (E) Different windows of classifier predictions, showing the categorical responses throughout processing.

The probabilistic decoding predictions of percept category correlated linearly with sensory evidence between 210 and 530 ms (r = 0.38 ±0.03, temporal-cluster p<0.001). The spatial decoders fit from 200 to 400 ms clustered around the VWFA (mean t-value=4.6; p=0.02; 224 vertices). These results suggest that this region first represents the stimulus objectively (i.e. in its full ambiguity).

Between 400 and 810 ms, the predictions of ‘perceptual category’ decoders were better accounted for by sigmoidal (r = 0.77 ±0.03, p<0.001) than by linear trends (r = 0.7 ±0.03, p<0.001). This suggests that later responses track the categorical perception rather than the linearly varying input. Spatial decoding analyses restricted to the 500–700 ms time window was more distributed (mean t-value=4.4; p=0.022; 110 vertices). Finally, ambiguous stimuli (steps 5 and 6 on the continuum) reached maximum decodability 205 ms later than unambiguous stimuli (steps 1 and 8) (p<0.001) (Figure 6B). The interaction between trend (linear or sigmoidal) and window latency was significant across subjects (r = 0.07; SEM = 0.01; p=0.002).

This progressive categorization of the letter/digit representations contrasts with the all-or-none pattern of motor signals. Specifically, the probabilistic predictions of button-press decoders varied categorically with response evidence from 440 to 1290 ms (sigmoid > linear cluster, mean t-value=3.17; p<0.001). There was also a more transient linear trend from 410 to 580 ms (mean t-value=3.69; p<0.001). This suggests that, unlike perceptual category, motor signals largely derive from categorical inputs.

Together, delay (Figure 5) and categorization (Figure 6) analyses thus show that perceptual representations slowly become categorical and are subsequently followed by all-or-none motor representations.

Discussion

While the role of feedforward processes is becoming increasingly understood, what recurrent processes represent and how they are orchestrated remains largely unknown. Here, we show with source-localized MEG that recurrent processes sequentially generate, over an extended time period, a hierarchy of representations that ultimately account for the timing and the content of perceptual reports.

The conclusions of the present study are limited by two main aspects. First, while our sensory-motor and letter/digit representations are largely consistent with previous findings (Cohen et al., 2000; Shum et al., 2013), MEG source reconstruction remains imperfect. Consequently, identifying (1) the role of subcortical areas and (2) the extent to which representations dynamically change within each brain area will necessitate invasive brain recordings.

Second, our study focuses on the brain responses to individual characters. This unusual task (King and Dehaene, 2014a) thus adds to the long list of arbitrary stimuli used to probe the neural bases of decisions. Indeed, perceptual decisions have been investigated through the manipulation of Gabor patches (e.g. Wyart et al., 2012), clouds of moving dots (Shadlen and Newsome, 2001) and even bathroom and kitchen images (Linsley and MacEvoy, 2014). Our results are consistent with these studies in that perceptual decisions are represented up to the fronto-parietal cortices. In particular, Freedman et al., 2002 parametrically manipulated images of cats and dogs through 3D morphing and also show that the lateral prefrontal cortex reflects the category of the stimuli independently of their physical similarity. The benefit of our design choice is that it allowed us to (1) orthogonalize and parametrically manipulate five levels of representations and (2) track the interplay between these different levels of representations in both time and space. In the future, it will thus be critical to verify that these findings can be observed across a wide variety of stimuli (e.g. Kar et al., 2019; Kietzmann et al., 2019), and to further investigate whether decisional boundaries can be manipulated online by specific task demands (Freedman et al., 2002).

Overall, our results bridge three important lines of research on the neural and computational bases of visual processing.

First, core-object recognition research, generally based on ~100 ms-long image presentations has repeatedly shown that the spiking responses of the inferotemporal cortex is better explained by recurrent models than by feedforward ones (Lamme and Roelfsema, 2000; Kar et al., 2019). In particular, Kar et al., 2019 have recently shown that images that are challenging to recognize, lead to delayed content-specific spiking activity in the macaque’s infero-temporal cortex. Similar evidence for recurrent processes was recently found using MEG (Kietzmann et al., 2019). Our findings, based on simpler but highly controlled stimuli, are consistent with these results and further highlight that perceptual representations are not confined to the inferotemporal cortices, but also reach a large variety of parietal and prefrontal areas (Freedman and Miller, 2008).

The specific order that perceptual representations were generated was not entirely predictable a priori. In particular, we were surprised to find that Uncertainty was one of the last variables to come online (∼300 ms) and extended into the processing of the subsequent trial. This result may relate to the fact that, here, Uncertainty is confounded with memory and task-engagement effects, rather than solely the processing of the stimulus property per se (Bate et al., 1998). However, this result starkly contrasts with recent work showing very early sensitivity to Uncertainty (∼50 ms) in an auditory syllable categorization task (Gwilliams et al., 2018), suggesting that the latency of this response may also depend on the sensory modality and familiarity with the visual or auditory object at study.

Second, the present study makes important contributions to the perceptual decision making literature (Gold and Shadlen, 2007; O'Connell et al., 2012). With some notable exceptions (e.g. Philiastides and Sajda, 2007), this line of research primarily aims to isolate motor and supra-modal decision signals in the presence of sustained visual inputs: that is, neural responses ramping toward a virtual decision threshold, independently of the representation on which this decision is based (O'Connell et al., 2012). The present study complements this approach by tracking the representation-specific signals that slowly emerge after a brief stimulus.

Our results thus open an exciting avenue for querying the gating mechanisms of successive decisions and clarifying the role of the prefrontal areas in the coordination multiple perceptual and supramodal modules (Sarafyazd and Jazayeri, 2019).

Finally, our results constitute an important confirmation of modern theories of perception. In particular, the Global Neuronal Workspace Theory predicts that perceptual representations need to be broadcast to associative cortices via the fronto-parietal areas to lead to subjective reports (Dehaene and Changeux, 2011a). Yet, at some notable exceptions (Tong et al., 1998; King et al., 2016), previous studies often fail to dissociate perceptual contents and perceptual reports (e.g. Sergent et al., 2005; van Vugt et al., 2018). By contrast, the present experimental design allows an unprecedented dissection of the distinct processing stages that transform sensory input into perceptual representations and, ultimately, actions. The generation of letter and digit representations in the dedicated brain areas (Cohen et al., 2000; Shum et al., 2013) and their subsequent broadcast to the cortex reinforce the notion that subjective perception relate to the global sharing of content-specific representations across brain areas (Dehaene and Changeux, 2011a; Lamme, 2003).

Materials and methods

Target stimuli

Request a detailed protocol

Using the font designed in King and Dehaene, 2014a, the stimuli were made from 0, 4, 5, 6, 8, 9, A, C, E, H, O, S, or from a linear combination of two of these characters varying in a single black bar (hereafter ‘pixel’). The corresponding ‘morphs’ were created by adjusting the contrast of the remaining pixel along eight equally spaced steps between 0 (no bar) and 1 (black bar).

Experiment 1

Eight subjects with normal or corrected vision, seated ~60 cm from a 19’ CRT monitor (60 Hz refresh rate, resolution: 1024 × 768), performed a stimulus identification task with continuous judgements across 28 variably ambiguous stimuli generated from digit stimuli. Ten euros were provided in compensation for this 1 hr experiment.

Subjects performed four blocks of 50 trials, each organized in the following way. After a 200 ms fixation, a target stimulus, randomly selected from one of the 28 stimuli, was flashed for 83 ms on a 50% gray background to the left or to the right of fixation. The orientation of the reporting disk (e.g 5-6-8-9 versus 5-9-8-6) was counterbalanced across subjects. Subjects had then up to 10 s to move a cursor on a large disk to report their percepts. The radius on the disk indicated subjective visibility (center = did not see the stimulus, disk border = max visibility). The angle on the disk indicated subjective identity (e.g. 5, 6, 8, 9 for the top left, top right, bottom right, and bottom left ‘corners’, respectively). Inter-trial interval was 500 ms. To verify that subjects provided meaningful reports, the target stimulus was absent 15% of the trials. Absent trials were rated with a low visibility (defined as radius below 5% of the disk radius) in most cases. Absent trials and trials reported with a low visibility were excluded from subsequent analyses (16% ± 1.4%). The report distribution plotted in Figure 1B were generated with Seaborn’s bivariate Gaussian kernel density estimate function with default parameters.

Modeling categorical reports

Request a detailed protocol

To test whether subjective reports of stimulus identity varied linearly or categorically with sensory evidence, we analyzed how reports’ angle (i.e. subjective identity) varied with the expected angle given the stimulus (i.e. sensory evidence).

For each morph (5–6, 5–8, 9–8 and 6–8) separately, we fit a linear model: 

(1) y^β1x+β0

and a sigmoidal model:

(2) y^11+exp(β1x+β2)+β0

where y^ is the report angle predicted by the model, x is expected angle given the stimulus pixels and β0 is a free bias parameter.

To minimize the effects of noise, behavioral reports were first averaged within each level of evidence, sorted from the stimulus with the least pixels (e.g. 5, in 5–6 morph) to the stimulus with the most pixels (e.g. 6 in the 5–6 morph). The resulting averages were normalized to range between 0 and 1 within each subject. The β parameters were fit with Scipy’s ‘curve_fit’ function (Jones et al., 2001) to minimize a mean squared error across trials i:

(3) argminβi(yiy^i)2

Because the linear and sigmoidal models have distinct numbers of free parameters, we compared them within a five-split cross-validation. Specifically, the two models were repeatedly fit and tested on independent trials. A Pearson correlation coefficient r summarized the ability of each model to accurately predict ytest given xtrain, ytrain and xtest. Finally, a Wilcoxon test was applied across subjects to test whether the two models were consistently above chance (r>0) and consistently different from one another (rsigmoid>rlinear).

Experiment 2

Request a detailed protocol

This experiment was performed at Neurospin, Gif usr Yvette, thanks to the support of Stanislas Dehaene. Seventeen subjects performed a discrete identification task across 22 variably ambiguous stimuli generated from letters and digits inside an Elekta Neuromag MEG scanner (204 planar gradiometers and 102 magnetometers). Seventy euros were provided in compensation to the 1 hr experiment and 30 min of preparation.

A sample size of 17 participants was selected based on previous visual studies utilizing the same MEG machine (King et al., 2016).

Participants’ head shape was digitized along with five fiducial points on the forehead and on each aural canal. Five head-position coils were placed on subjects head and localized at the beginning of each block.

The trial structure was as follows. A black fixation cross was displayed on a 50% gray background for 300 ms followed by a 100ms-long target stimulus presented on the left or on the right of fixation. Two task-irrelevant flankers (e.g. stimulus can be read as an S or a 5) were displayed on the side of this target stimulus to increase our chances of eliciting recurrent processing via crowding (Strasburger et al., 2011). Subjects were given two seconds to report the identity of the stimulus. Reports of stimulus identity were given by pressing a button with the left and right index fingers respectively. The identity-button mapping changed on every block to orthogonalize the neural correlates of stimulus identity and the neural correlates of motor actions. For example, in block 1, perceiving an E or a 4 should have been reported with a left button press, whereas in block 2, E and 4 should have been reported with a right button press. The identity-button was explicitly reminded before each block. In addition, a visual feedback was displayed after non-ambiguous trials. Specifically, the fixation turned green for 100 ms or red for 300 ms in correct and incorrect trials respectively. The brain responses to these feedback stimulations are not analyzed in the present study. Inter-trial interval was 1 s. Subjects were provided a short training to ensure they understood the task, and identified non-ambiguous targets at least 80% of the time.

A total of 1920 trials, grouped into 40 blocks, were performed by each subject, 320 of which were presented passively at the end of each block – subjects were not required to provide a response. The trial structure was generated by (i) permuting all combinations of stimulus features (e.g. position, identity, response mapping, uncertainty), and (ii) shuffling the order of presentation for each subject. The experiment was presented using Psychtoolbox (Kleiner et al., 2007).

All experiments were approved by the local ethics committee. All subjects signed an informed consent form.

Structural MRI

Request a detailed protocol

For each subject, an anatomical MRI with a resolution of 1×1×1.1 mm was acquired after the MEG experiment with a 3T Siemens scanner. Gray and white matter were segmented with Freesurfer ‘recon-all’ pipeline (Fischl, 2012) and coregistered with each subject’s digitized head shapes along with fiducial points.

Preprocessing

Request a detailed protocol

The continuous MEG recording was noise-reduced using Maxfilter’s SSS correction on the raw data, bandpass-filtered between 0.5 and 40 Hz using MNE-Python’s default parameters with firwin design (Gramfort et al., 2014) and downsampled to 250 Hz. Epochs were then segmented between −300 ms and +1500 ms relative to stimulus onsets.

After coregistering the MEG sensor data with subjects’ structural MRI and the head position coils, we computed the forward model using a 1-layer (inner skull) boundary element model, for each subject separately and fit a minimum-norm inverse model (signal to noise ratio: 3, loose dipole fitting: 0.2, with normal orientation of the dipole relative to the cortical sheet) using the noise covariance across sensors averaged over the pre-stimulus baseline across trials. Finally, the inverse model was applied to single-trial data resulting in a dynamic Statistical Parameter Map (dSPM) (Dale et al., 2000) value for each source at each time sample.

Modeled features

Request a detailed protocol

We investigated whether single-trial source and sensor evoked responses varied as a function of five features: (1) the position of the stimulus on the computer screen (left versus right of fixation), (2) the morph from which the stimulus is generated (E-6 versus H-4), (3) the category of the stimulus (letter versus digit), (4) the uncertainty of the trial (maximum uncertainty = stimuli with pixel at 50% contrast; minimally uncertain stimuli with pixels at 0% or 100% contrast), and (5) the response button used to report the stimulus (left versus right button). By design, these five features are independent of one another.

It is challenging to dissociate brain responses that represent objective sensory information from those that represent perceptual decisions as the two are generally collinear. To address this issue, we first fit univariate and multivariate models to predict perceptual category: that is, whether the button press indicated a character that belongs to the digit or to the letter category. This feature is independent of the button press (e.g. the letter E and the digit 4 can be reported with the same button in a given block). Furthermore, this feature is not necessary to perform the task (i.e. knowing whether E and H are letters is unnecessary to discriminate them). We reasoned that if subjects automatically generates letter/digit representations during perceptual categorization, then we should be able to track the generation of this abstract feature from brain activity.

Mass univariate statistics

Request a detailed protocol

To estimate whether brain responses correlated with each of these five features, we first fit, within each subject, mass univariate analyses at each source location and for each time sample with a linear regression:

(4) β=(XTX)-1Xy

where Xn,f is a design matrix of n epochs by f=5 features and yn is the univariate brain response at a given source and at given time. The effect sizes β were then passed to second-level statistics across subjects corrected for multiple comparisons using non-parametric spatio-temporal cluster testing (see below).

Decoding

Request a detailed protocol

Decoding analyses consists in predicting each feature from multivariate brain responses. Decoding analyses were performed within a five-split stratified K-Fold cross-validation using l2-regularized linear models. Classifiers consisted of logistic regressions (with scikit-learn Pedregosa, 2011’s default parameters: C=1):

(5) argminβilog(1+exp(yiβTxi))+Cβ2

where yi{±1} is the feature to be decoded at trial i and xi is the multivariate brain response.

Regressors consisted of ridge regression (with scikit-learn Pedregosa, 2011’s default parameters: α=1).

(6) argminβi(yiβTxi)2+αβ2

For each subject independently, decoding performance was summarized across trials, with an area under the curve (AUC) and a Spearman r correlation score for classifiers and regressors, respectively.

All decoders were provided with data normalized by the mean and the standard deviation in the training set.

Spatial decoding consists in fitting a series of decoders at each brain source independently, across all 1500 time samples relative to stimulus onset. This analysis results in a decoding brain map that indicates where a feature can be linearly decoded in the brain. These decoding maps were then passed to cluster-corrected second-level statistics across subjects.

Temporal decoding consists in fitting a series of decoders at each time sample independently, across all 306 MEG sensors. This analysis results in a decoding time course that indicates when a feature can be linearly decoded from MEG signals. These decoding time courses were then passed to cluster-corrected second-level statistics across subjects.

Temporal generalization (TG) consists in testing whether a temporal decoder fit on a training set at time t can decode a testing set at time t (King and Dehaene, 2014b). TG can be summarized with a square training time × testing time decoding matrix. To quantify the stability of neural representations, we measured the duration of above-chance generalization of each temporal decoder. To quantify the dynamics of neural representations, we compared the mean duration of above-chance generalization across temporal decoders to the duration of above-chance temporal decoding (i.e. the diagonal of the matrix versus its rows). These two metrics were assessed within each subject and tested with second-level statistics across subjects.

Permutation cluster test

Request a detailed protocol

To evaluate the statistical significance of the univariate and multivariate analyses, we used a one-sample permutation cluster test as implemented in MNE-Python (Gramfort et al., 2014). We use the default parameters of the ‘spatio temporal cluster one sample test’ from mne version 0.17.1.

First we center the data around the theoretical chance level (e.g. 0.5 for AUC, 0 for Spearman correlation or beta coefficient). A one-sample t-test is performed at each location in time and space. Then, spatio-temporally adjacent data-points are clustered based on a cluster-forming threshold of p<0.05. The test statistic for each cluster is the sum of the t-values across time and space. Randomized data are generated with random sign flips, and a new set of clusters are formed. The null distribution is created based on the summed t-values that are generated from 5000 random permutations of the data. This analysis follows Maris and Oostenveld, 2007.

Linear versus categorical

Request a detailed protocol

To test whether neural representations varied as a function of (i) reaction times (RTs, split into four quantiles), (ii) sensory evidence (i.e. the extent to which the stimulus objectively corresponds to a letter) and (iii) motor evidence (i.e. whether the stimulus should have led to the left button press), we analyzed the extent to which decoders’ predictions covaried with each of these three variables z

(7) f(z,βTX)

where f is a linear or a sigmoidal model, X is the multivariate brain response and β is the decoder’s coefficient fit with cross-validation.

Statistics

Univariate, decoding and TG models were fit within subjects, and tested across subjects. In case of repeated estimates (e.g. temporal decoding is repeated at each time sample), statistics derived from non-parametric cluster-testing with 10,000 permutations across subjects with MNE-Python’s default parameters (Gramfort et al., 2014).

Simulations

Request a detailed protocol

To formalize how distinct neural architectures lead to distinct spatio-temporal dynamics, we modeled discrete linear dynamical systems forced with a transient input U. Specifically:

(8) Xt+1=AXt+BUt

where X is a multidimensional times series (i.e. neurons x time), A is the architecture, and corresponds to square connectivity matrix (i.e. neurons x neurons), B is an input connectivity matrix (i.e. inputs x neurons), and U is the input vector.

Distinct architectures differ in the way units are connected with one another. For simplicity purposes, we order units in the A matrix such that their row index correspond to their hierarchical levels.

In this view, the recurrent, feedforward and skip connections of the architecture A were modeled as a binary diagonal matrix R, a shift matrix F and a matrix S with one entries in the last column respectively. These three matrices were modulated by specific weights, as detailed below. The input U was only connected to the first 'processing stage’, that is, to the first unit(s) of A, via a matrix B constant across architectures, and consisted of a transient square-wave input, that mimics the transient flash of the stimulus onto subjects’ retina.

To model multiple features, we adopted the same procedure with multiple units per layer. Each unit within each layer was forced to encode a specific feature.

Each architecture shown in Figure 3 was fed an input at t = 1, and simulated for eight time steps. Finally, temporal generalization analyses based on the architectures’ activations were applied for each of the features.

The same architecture is shown in Figure 5B. Here we simulate for a total of 108 time-steps, with an arbitrary delay of 100 time-steps between t = 6 and t = 106.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
    SciPy: Open Source Scientific Tools for Python, version 3.0.4
    1. E Jones
    2. T Oliphant
    3. P Peterson
    (2001)
    SciPy: Open Source Scientific Tools for Python.
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
    Scikit-learn: machine learning in Python
    1. F Pedregosa
    (2011)
    Journal of Machine Learning Research 12:2825–2830.
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
    Using brain-score to evaluate and build neural networks for brain-like object recognition
    1. M Schrimpf
    2. J Kubilius
    3. H Hong
    4. NJ Majaj
    5. R Rajalingham
    6. C Ziemba
    7. EB Issa
    8. K Kar
    9. P Bashivan
    10. J Prescott-Roy
    11. K Schmidt
    12. A Nayebi
    13. D Bear
    14. DLK Yamins
    15. JJ DiCarlo
    (2019)
    Cosyne 19.
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50

Decision letter

  1. Thomas Serre
    Reviewing Editor; Brown University, United States
  2. Michael J Frank
    Senior Editor; Brown University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Recurrent processes support a cascade of hierarchical decisions" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Thomas Serre as Reviewing Editor and Michael Frank as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

The authors use a combination of MEG, structural MRI, and computational modeling to measure how the visual cortex accumulates information for discriminating between objects. Using a digit/letter warping dataset, the authors identify difficult exemplars, devise clever analyses to show when representations become categorical, and combine temporal decoding analyses with modeling to describe dynamics and infer computations. This paper has a significant number of results, elegantly presented in beautiful figures. While some (if not most) of the conclusions derived from the work may have been reached independently by prior studies, one major strength of the present manuscript is to examine all these questions within a single dataset. Further qualities are the use of an elegant experimental design, thorough decoding analysis methods, and adequate use of modeling to help disentangle alternative explanations.

However, as detailed below, the reviewers have identified a number of weaknesses and are requesting that the authors comment on these critiques. One of the main issues raised by the reviewers has to do with some of the effect sizes and underlying statistical tests.

Essential revisions:

Statistics

The reviewers struggled a bit with the effect sizes reported. The authors normalize their scores (Figure 2C, Figure 4), which makes the effects look strikingly similar. But the truth is that some of the effect sizes are so small (AUCs between 0.5-0.6) that it can be hard to accept some of the findings. The reviewers would like to ask the authors to think of additional analyses that they could run that would ease these concerns.

While the maximum values of the curves in this figure are very different in range, the variations in baselines of blue, green, and orange curves do not show the scaling. Please provide figures without the scaling.

Is the trial uncertainty decoding time course significant? The effect size is very small compared to other features and it is very distributed over the brain which makes us wonder if this feature is actually readable from the brain activity.

There are no statistical tests reported in Figure 1G. Please mark significant decoding scores over time by drawing a contour line around the significant clusters. Please describe the statistical tests in Figure 2A and B as the multiple comparison corrections are unclear.

Figure 2A is thresholded based on t-values that exceed an uncorrected p <.1. The reviewers are hoping this is a typo.

For the curves in Figure 2C, the authors do not indicate the time points when the scores are above chance. For example, we do not know if the blue curve with a max of 0.08 is even significant.

In addition to multiple comparison corrections across time, the authors should correct for multiple comparisons across five features.

The authors have not reported the thresholds they use for cluster definition and cluster size corrections. Please comment. This is especially important because it is not clear if the authors have also corrected for 5 multiple comparisons across the five features.

Subsection “Hierarchical recurrence implements a series of all-or-none decisions”: Between 400 and 810 ms, the predictions of 'perceptual category' decoders were better accounted for by sigmoidal (r=0.77 +/-0.03, p<0.001) than by linear trends (r=0.77 +/-0.03, p<0.001)? Please comment.

Subsection “Hierarchical recurrence induces an accumulation of delays”: the authors test the correlation of peak latency of averaged temporal decodings when averaged over training times. Please do this analysis with the temporal decoding time courses in Figure 2C. Because the main temporal dynamics occur along the diagonal of the TG decoding matrix.

Modeling

Another issue is with the modeling simulations described in the subsection “Statistics”, which disambiguates between the hypotheses in Figure 3. The reviewers' (maybe incorrect) interpretation was that the authors tested whether or not these stimuli are being processed via hierarchical recurrent computations or not. The reviewers thought this was a strawman argument, as there is no reason to suspect the converse (non-hierarchical/non-recurrent). This modeling work thus only added to their overall feeling that the contributions of the present manuscript were actually quite limited.

Interpretation

We would suggest adding clear statements in the Abstract and in the Discussion cautioning that these exact results may be limited to this specific task (difficult digit vs. letter classification), and could differ for other tasks (e.g. simple detection task, natural scene or object categorization).

Contributions

There must be a discussion of (Freedman et al., 2002). Those authors parametrically warped dog/cat stimuli to show that a region of the prefrontal cortex (PFC) reflected stimulus discriminability. This is of course closely related to the present work, where the authors use digit/letter stimuli to accomplish the same thing and focus on the visual cortex rather than PFC. The reviewers request some clarification about the contributions of the present study in light of this work and especially discuss the possibility that most of the presented results could be reflecting common input from PFC as suggested by Linsley and MacEvoy, 2014.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your article "Recurrent processes support a cascade of hierarchical decisions" for consideration by eLife. Your revised article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

The reviewers made a few additional comments.

1) The most important one deals with the lack of significance in your mass-univariate analyses. We suggest you describe the analyses in the main text and explain that it did not reach significance. Because the results make sense, the reviewers suggest to keep them but to move it to the SI as they should be taken with a grain of salt (and state that).

2) Related to comment 2 on statistics, the figures in Figure 2—figure supplements 1 and 2 again are scaled; because the y-axis of all plots are scaled to the maximum of each plot.

https://doi.org/10.7554/eLife.56603.sa1

Author response

Essential revisions:

Statistics

The reviewers struggled a bit with the effect sizes reported. The authors normalize their scores (Figure 2C, Figure 4), which makes the effects look strikingly similar. But the truth is that some of the effect sizes are so small (AUCs between 0.5-0.6) that it can be hard to accept some of the findings… The reviewers would like to ask the authors to think of additional analyses that they could run that would ease these concerns.

Thank you for highlighting this issue. It is true that some of the effect sizes are small. However, the vast majority of these effects are highly reliable and consistent across subjects. In addition, the motivation behind the normalization of the decoding scores is to highlight their temporal differences. We believe that the successive development of the five representations is a more valuable information than the fact that the visual position of the stimulus is much easier to decode than its letter/digit category.

To strengthen confidence in the robustness of our results we have now added:

i) Several supplementary figures, including non-scaled decoding time-courses (Figure 2—figure supplements 1 and 2), and a video showing the effects of each individual subject (violin plots) across time (Figure 2—animation 1).

ii) A number of provisos in the text itself highlighting the small effect sizes where appropriate.

We believe that these additional results make it clear that although the decoding scores obtained within each subject may be small in effect size, they are sufficiently consistent across subjects to support our conclusions.

While the maximum values of the curves in this figure are very different in range (.), the variations in baselines of blue, green, and orange curves do not show the scaling. Please provide figures without the scaling.

We agree with this remark. We have added stimulus-locked and response-locked non-scaled decoding time-courses (Figure 2—figure supplements 1 and 2).

Is the trial uncertainty decoding time course significant? The effect size is very small compared to other features and it is very distributed over the brain which makes us wonder if this feature is actually readable from the brain activity.

Yes, the uncertainty decoding is significant (Average R = 0.12; SEM = 0.024; p < 0.01) (subsection “Neural representations are functionally organized over time and space”).

Note that because this is a continuous variable, we used a ridge regression to decode it, and a Spearman correlation to evaluate the performance of this decoder. This analysis is thus in a different metric scale than the categorical variables, which are decoded with a Logistic Regression and summarized with an AUC.

We have clarified this issue in the text.

There are no statistical tests reported in Figure 1G. Please mark significant decoding scores over time by drawing a contour line around the significant clusters. Please describe the statistical tests in Figure 2A and B as the multiple comparison corrections are unclear.

We apologize for this omission. We have now added analysis details regarding the univariate spatio-temporal cluster test, and the multivariate spatial cluster test in the text. In addition, we now supply supplementary figures showing the masks of significant decoding accuracy for both tests (Figure 2—figure supplements 4-9).

For the multivariate spatial cluster test, decoding performance is only displayed for sources that are within a significant cluster (p <.05). We have made this clear in the figure legend.

Overall, these additional analyses confirm and strengthen our original results.

Figure 2A is thresholded based on t-values that exceed an uncorrected p <.1. The reviewers are hoping this is a typo.

This is not a typo, but an explicit choice, which we insufficiently explained in our original manuscript.

For the mass-univariate analyses in source-space the effect of letter/digit decision did not reach statistical significance after control for multiple comparisons (p = 0.21). The spatio-temporal test was applied across the whole brain (~5,000 sources) and across the entire epoch (0:1500 ms), making the procedure very conservative. Because the letter/digit contrast is highly significant when using temporal (p <.001) and spatial (p <.001) decoders, we reasoned it would nonetheless be informative to visualize where the corresponding univariate peak activity would be. We thus adapted the plotting threshold to display the location of the strongest univariate effects. These results point to the Visual Word Form and Number Form Area (Dehaene and Cohen, 2011), as expected.

Overall, these results illustrate that multivariate decoding analyses can be much more sensitive to subtle effects, which would have otherwise been missed with standard mass-univariate analyses, should we not know a priori where to look for them. Nonetheless, these results come at the price of a diminished spatial or temporal specificity.

We have clarified this in the text and the figure legend.

For the curves in Figure 2C, the authors do not indicate the time points when the scores are above chance. For example, we do not know if the blue curve with a max of 0.08 is even significant.

Thank you for the suggestion. We have now added indicators of significance for the temporal cluster test in Figure 2C.

In addition to multiple comparison corrections across time, the authors should correct for multiple comparisons across five features.

As described in the Materials and methods, we use a temporal and/or spatial permutation cluster test which allows us to avoid the issue of corrections for multiple comparisons across time and space.

We have not corrected our results for the five features of interest in the main text. However, we have added feature-corrected results to the supplementary materials, broken down into temporal decoding, spatial decoding and mass univariate analyses. Note that this additional correction does not influence the interpretation of any of our results.

The authors have not reported the thresholds they use for cluster definition and cluster size corrections. Please comment. This is especially important because it is not clear if the authors have also corrected for 5 multiple comparisons across the five features.

We apologize for this lack of precision. For all permutation cluster tests we use the default parameters as provided in the Python module MNE-Python version 0.17.1. Clusters are formed with an initial p <.05 threshold. No cluster size correction is applied. We have added this information to the Materials and methods section of the manuscript.

Subsection “Hierarchical recurrence implements a series of all-or-none decisions”: Between 400 and 810 ms, the predictions of 'perceptual category' decoders were better accounted for by sigmoidal (r=0.77 +/-0.03, p<0.001) than by linear trends (r=0.77 +/-0.03, p<0.001)? Please comment.

Thank you for pointing out this mistake. The linear trend was r=0.7 +/- 0.03, p<0.001, which we have now added to the manuscript.

Subsection “Hierarchical recurrence induces an accumulation of delays”: the authors test the correlation of peak latency of averaged temporal decodings when averaged over training times. Please do this analysis with the temporal decoding time courses in Figure 2C. Because the main temporal dynamics occur along the diagonal of the TG decoding matrix.

The delay analysis is not based on the diagonal decoding peak latency. We operationalize delays in processing as shifts relative to the diagonal plane. This involves first aligning the test-time axis relative to the diagonal, and then averaging over train time. To aid interpretation we have added a schematic of the method (Figure 5—figure supplement 1).

Modeling

Another issue is with the modeling simulations described in the subsection “Statistics”, which disambiguates between the hypotheses in Figure 3. The reviewers' (maybe incorrect) interpretation was that the authors tested whether or not these stimuli are being processed via hierarchical recurrent computations or not. The reviewers thought this was a strawman argument, as there is no reason to suspect the converse (non-hierarchical/non-recurrent). This modeling work thus only added to their overall feeling that the contributions of the present manuscript were actually quite limited.

All but one model included both hierarchical and recurrent processes. Consequently, the main goal of our modelling was not to test whether hierarchy and recurrency exists, but rather how they can be characterized and what underlying neural architecture their spatio-temporal dynamics imply. To clarify this issue, we added a paragraph to introduce the model comparisons and amended the discussion.

Interpretation

We would suggest adding clear statements in the Abstract and in the Discussion cautioning that these exact results may be limited to this specific task (difficult digit vs. letter classification), and could differ for other tasks (e.g. simple detection task, natural scene or object categorization).

This is a fair remark. While it applies to the vast majority of perceptual decision-making studies, we amended the Abstract and Discussion to highlight the fact that we specifically focus on the restricted case of reading individual characters.

Contributions

There must be a discussion of (Freedman et al., 2002). Those authors parametrically warped dog/cat stimuli to show that a region of the prefrontal cortex (PFC) reflected stimulus discriminability. This is of course closely related to the present work, where the authors use digit/letter stimuli to accomplish the same thing and focus on the visual cortex rather than PFC. The reviewers request some clarification about the contributions of the present study in light of this work and especially discuss the possibility that most of the presented results could be reflecting common input from PFC as suggested by Linsley and MacEvoy, 2014.

We agree with the relevance of the study from Freedman and colleagues (Freedman et al., 2002), in which they recorded PFC neurons in monkeys performing a cats/dogs categorization task. Our study supplements this work in that we

1) estimate the neural responses of a much larger set of brain areas

2) parametrically manipulate 5 levels of representations

3) formalize the different recurrent architectures that could have accounted for our decoding scores

4) record from human subjects

We have now added a paragraph in the Discussion relating our results to those of Freedman et al., 2002, and made explicit the contribution of the current work. We also mention Linsley and MacEvoy’s paper as one of the papers investigating perceptual decision using a different set of stimuli.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The reviewers made a few additional comments.

1) The most important one deals with the lack of significance in your mass-univariate analyses. We suggest you describe the analyses in the main text and explain that it did not reach significance. Because the results make sense, the reviewers suggest to keep them but to move it to the SI as they should be taken with a grain of salt (and state that).

While we appreciate this concern given the weakness of some of the effects, we believe that it is important to keep the univariate figure for a number of reasons.

First, all but one of the analyses are significant. Removing this figure because one of the effects does not reach statistical significance is not legitimate.

Second, we explicitly warn the reader both in the figure legend and in the main text that one of the univariate effects (letter/digit category) does not reach statistical significance after correction for multiple comparisons:

“Modelling neural activity as a function of our orthogonal stimulus properties yielded non-significant results for the decision variable of interest. […] The direct comparison between the sensitivity of the mass-univariate versus multivariate approaches are shown in Figure 2.”

Third, keeping the univariate analyses allows us to highlight the robustness and complementarity of the multivariate analyses. Specifically, MVPA over time clearly shows that letter/digit category can be decoded from ~150 ms after word onset, and MVPA over space clearly shows that this representation peaks around the visual word form area. These significant effects are compatible with the trend that we observed with univariate analyses.

Fourth, the non-significant univariate trends actually correspond to the region of interest that we expected: i.e. the visual word form area (Dehaene and Cohen, 2011).

Together, these elements make us believe that the univariate figures should be present in the main text. We have, however, added the non-thresholded maps of p-values in supplementary materials for completeness purposes.

2) Related to comment 2 on statistics, the figures Figure 2—figure supplements 1 and 2 again are scaled; because the y-axis of all plots are scaled to the maximum of each plot.

We have now added a non-scaled version of Figure 2C in the supplementary materials, time locked to stimulus onset and response onset.

https://doi.org/10.7554/eLife.56603.sa2

Article and author information

Author details

  1. Laura Gwilliams

    1. Department of Psychology, New York University, New York, United States
    2. NYU Abu Dhabi Institute, Abu Dhabi, United Arab Emirates
    Contribution
    Conceptualization, Formal analysis, Investigation, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    leg5@nyu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9213-588X
  2. Jean-Remi King

    1. Department of Psychology, New York University, New York, United States
    2. Frankfurt Institute for Advanced Studies, Frankfurt, Germany
    3. Laboratoire des Systèmes Perceptifs (CNRS UMR 8248), Département d’Études Cognitives, École Normale Supérieure, PSL University, Paris, France
    Contribution
    Conceptualization, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2121-170X

Funding

William Orr Dingwall Foundation (Dissertation Fellowship)

  • Laura Gwilliams

Abu Dhabi Institute Grant (G1001)

  • Laura Gwilliams

Horizon 2020 Framework Programme (660086)

  • Jean-Remi King

Fondation Bettencourt Schueller (Bettencourt-Schueller Foundation)

  • Jean-Remi King

Fondation Roger de Spoelberch (Fondation Roger de Spoelberch)

  • Jean-Remi King

Philippe Foundation (Philippe Foundation)

  • Jean-Remi King

National Institutes of Health (R01DC05660)

  • Laura Gwilliams

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This project received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 660086, the Bettencourt-Schueller Foundation, the Fyssen Foundation, the Philippe Foundation, the Abu Dhabi Institute G1001, NIH R01DC05660 and the Dingwall Foundation. We are infinitely grateful to Stanislas Dehaene, as well as David Poeppel and Alec Marantz, for their support. We thank Michael Landy for his very helpful and generous feedback on a previous version of the manuscript.

Ethics

Human subjects: This study was ethically approved by the comité de protection des personnes (CPP) IDF 7 under the reference CPP 08 021. All subjects gave written informed consent to participate in this study, which was approved by the local Ethics Committee, in accordance with the Declaration of Helsinki. Participants were compensated for their participation.

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. Thomas Serre, Brown University, United States

Publication history

  1. Received: March 3, 2020
  2. Accepted: August 30, 2020
  3. Accepted Manuscript published: September 1, 2020 (version 1)
  4. Version of Record published: September 18, 2020 (version 2)

Copyright

© 2020, Gwilliams and King

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,297
    Page views
  • 186
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Developmental Biology
    2. Neuroscience
    Athene Knüfer et al.
    Research Article Updated

    In the vertebrate central nervous system, groups of functionally related neurons, including cranial motor neurons of the brainstem, are frequently organised as nuclei. The molecular mechanisms governing the emergence of nuclear topography and circuit function are poorly understood. Here we investigate the role of cadherin-mediated adhesion in the development of zebrafish ocular motor (sub)nuclei. We find that developing ocular motor (sub)nuclei differentially express classical cadherins. Perturbing cadherin function in these neurons results in distinct defects in neuronal positioning, including scattering of dorsal cells and defective contralateral migration of ventral subnuclei. In addition, we show that cadherin-mediated interactions between adjacent subnuclei are critical for subnucleus position. We also find that disrupting cadherin adhesivity in dorsal oculomotor neurons impairs the larval optokinetic reflex, suggesting that neuronal clustering is important for co-ordinating circuit function. Our findings reveal that cadherins regulate distinct aspects of cranial motor neuron positioning and establish subnuclear topography and motor function.

    1. Neuroscience
    Anthony JE Berndt et al.
    Research Article

    Retrograde BMP signaling and canonical pMad/Medea-mediated transcription regulates diverse target genes across subsets of Drosophila efferent neurons, to differentiate neuropeptidergic neurons and promote motor neuron terminal maturation. How a common BMP signal regulates diverse target genes across neuronal subsets remains largely unresolved, although available evidence implicates subset-specific transcription factor codes rather than differences in BMP signaling. Here, we examine the cis-regulatory mechanisms restricting BMP-induced FMRFa neuropeptide expression to Tv4 neurons. We find that pMad/Medea bind at an atypical, low affinity motif in the FMRFa enhancer. Converting this motif to high affinity caused ectopic enhancer activity and eliminated Tv4 neuron expression. In silico searches identified additional motif instances functional in other efferent neurons, implicating broader functions for this motif in BMP-dependent enhancer activity. Thus, differential interpretation of a common BMP signal, conferred by low affinity pMad/Medea binding motifs, can contribute to the specification of BMP target genes in efferent neuron subsets.