1. Computational and Systems Biology
  2. Neuroscience
Download icon

Shared and modality-specific brain regions that mediate auditory and visual word comprehension

  1. Anne Keitel  Is a corresponding author
  2. Joachim Gross
  3. Christoph Kayser
  1. Psychology, University of Dundee, United Kingdom
  2. Institute of Neuroscience and Psychology, University of Glasgow, United Kingdom
  3. Institute for Biomagnetism and Biosignalanalysis, University of Münster, Germany
  4. Department for Cognitive Neuroscience, Faculty of Biology, Bielefeld University, Germany
Research Article
  • Cited 0
  • Views 826
  • Annotations
Cite this article as: eLife 2020;9:e56972 doi: 10.7554/eLife.56972

Abstract

Visual speech carried by lip movements is an integral part of communication. Yet, it remains unclear in how far visual and acoustic speech comprehension are mediated by the same brain regions. Using multivariate classification of full-brain MEG data, we first probed where the brain represents acoustically and visually conveyed word identities. We then tested where these sensory-driven representations are predictive of participants’ trial-wise comprehension. The comprehension-relevant representations of auditory and visual speech converged only in anterior angular and inferior frontal regions and were spatially dissociated from those representations that best reflected the sensory-driven word identity. These results provide a neural explanation for the behavioural dissociation of acoustic and visual speech comprehension and suggest that cerebral representations encoding word identities may be more modality-specific than often upheld.

Introduction

Acoustic and visual speech signals are both elemental for everyday communication. While acoustic speech consists of temporal and spectral modulations of sound pressure, visual speech consists of movements of the mouth, head, and hands. Movements of the mouth, lips and tongue in particular provide both redundant and complementary information to acoustic cues (Hall et al., 2005; Peelle and Sommers, 2015; Plass et al., 2019; Summerfield, 1992), and can help to enhance speech intelligibility in noisy environments and in a second language (Navarra and Soto-Faraco, 2007; Sumby and Pollack, 1954; Yi et al., 2013). While a plethora of studies have investigated the cerebral mechanisms underlying speech in general, we still have a limited understanding of the networks specifically mediating visual speech perception, that is lip reading (Bernstein and Liebenthal, 2014; Capek et al., 2008; Crosse et al., 2015). In particular, it remains unclear whether visual speech signals are largely represented in dedicated regions, or whether these signals are encoded by the same networks that mediate auditory speech perception.

Behaviourally, our ability to understand acoustic speech seems to be independent from our ability to understand visual speech. In the typical adult population, performance in auditory/verbal and visual speech comprehension tasks are uncorrelated (Conrad, 1977; Jeffers and Barley, 1980; Mohammed et al., 2006; Summerfield, 1991; Summerfield, 1992). Moreover, large inter-individual differences in lip reading skills contrast with the low variability seen in auditory speech tests (Summerfield, 1992). In contrast to this behavioural dissociation, neuroimaging and neuroanatomical studies have suggested the convergence of acoustic and visual speech information in specific brain regions (Calvert, 1997; Campbell, 2008; Ralph et al., 2017; Simanova et al., 2014). Prevalent models postulate a fronto-temporal network mediating acoustic speech representations, comprising a word-meaning pathway from auditory cortex to inferior frontal areas, and an articulatory pathway that extends from auditory to motor regions (Giordano et al., 2017; Giraud and Poeppel, 2012; Gross et al., 2013; Hickok, 2012; Huth et al., 2016; Morillon et al., 2019). Specifically, a number of anterior-temporal and frontal regions have been implied in implementing a-modal semantic representations (MacSweeney et al., 2008; Ralph et al., 2017; Simanova et al., 2014) and in enhancing speech perception in adverse environments, based on the combination of acoustic and visual signals (Giordano et al., 2017).

Yet, when it comes to representing visual speech signals themselves, our understanding becomes much less clear. That is, we know relatively little about which brain regions mediate lip reading. Previous studies have shown that visual speech activates ventral and dorsal visual pathways and bilateral fronto-temporal circuits (Bernstein and Liebenthal, 2014; Calvert, 1997; Campbell, 2008; Capek et al., 2008). Some studies have explicitly suggested that auditory regions are also involved in lip reading (Calvert, 1997; Calvert and Campbell, 2003; Capek et al., 2008; Lee and Noppeney, 2011; Pekkola et al., 2005), for example by receiving signals from visual cortices that can be exploited to establish coarse-grained acoustic representations (Bourguignon et al., 2020). While these findings can be seen to suggest that largely the same brain regions represent acoustic and visual speech, neuroimaging studies have left the nature and the functional specificity of these visual speech representations unclear (Bernstein and Liebenthal, 2014; Crosse et al., 2015; Ozker et al., 2018). This is in part because most studies focused on mapping activations rather than specific semantic or lexical speech content. Indeed, alternative accounts have been proposed, which hold that visual and auditory speech representations are largely distinct (Bernstein and Liebenthal, 2014; Evans et al., 2019).

When investigating how speech is encoded in the brain, it is important to distinguish purely stimulus driven neural activity (e.g. classic ‘activation’) from activity specifically representing a stimulus and contributing to the participant’s percept on an individual trial (Bouton et al., 2018; Grootswagers et al., 2018; Keitel et al., 2018; Panzeri et al., 2017; Tsunada et al., 2016). That is, it is important to differentiate the representations of sensory inputs per se from those representations of sensory information that directly contribute to, or at least correlate with, the single-trial behavioural outcome. Recent neuroimaging studies have suggested that those cerebral representations representing the physical speech are partly distinct from those reflecting the actually perceived meaning. For example, syllable identity can be decoded from temporal, occipital and frontal areas, but only focal activity in the inferior frontal gyrus (IFG) and posterior superior temporal gyrus (pSTG) mediates perceptual categorisation (Bouton et al., 2018). Similarly, the encoding of the acoustic speech envelope is seen widespread in the brain, but correct word comprehension correlates only with focal activity in temporal and motor regions (Scott, 2019; Keitel et al., 2018). In general, activity in lower sensory pathways seems to correlate more with the actual physical stimulus, while activity in specific higher-tier regions correlates with the subjective percept (Crochet et al., 2019; Romo et al., 2012). However, this differentiation poses a challenge for data analysis, and studies on sensory perception are only beginning to address this systematically (Grootswagers et al., 2018; Panzeri et al., 2017; Ritchie et al., 2015).

We here capitalise on this functional differentiation of cerebral speech representations that simply reflect the physical stimulus, from those representations of the sensory inputs that correlate with the perceptual outcome, to identify the comprehension-relevant encoding of auditory and visual word identity in the human brain. That is, we ask where and to what degree comprehension-relevant representations of auditory and visual speech overlap. To this end, we exploited a paradigm in which participants performed a comprehension task based on individual sentences that were presented either acoustically or visually (lip reading), while brain activity was recorded using MEG (Keitel et al., 2018). We then extracted single-trial word representations and applied multivariate classification analysis geared to quantify (i) where brain activity correctly encodes the actual word identity regardless of behavioural outcome, and (ii) where the quality of the cerebral representation of word identity (or its experimentally obtained readout) is predictive of the participant’s comprehension. Note that the term ‘word identity’ in the present study refers to the semantic, as well as the phonological form of a word (see Figure 1—figure supplement 1 for an exploratory semantic and phonological analysis).

Results

Behavioural performance

On each trial, the 20 participants viewed or listened to visually or acoustically presented sentences (presented in blocks), and performed a comprehension task on a specific target word (4-alternative forced-choice identification of word identity). The 18 target words, which always occurred in the third or second last position of the sentence, each appeared in 10 different auditory and visual sentences to facilitate the use of classification-based data analysis (see table in Supplementary file 1 for all used target words). Acoustic sentences were presented mixed with background noise, to equalise performance between visual and auditory trials. On average, participants perceived the correct target word in approximately 70% of trials across auditory and visual conditions (chance level was 25%). The behavioural performance did not differ significantly between these conditions (Mauditory = 69.7%, SD = 7.1%, Mvisual = 71.7%, SD = 20.0%; t(19) = −0.42, p=0.68; Figure 1), demonstrating that the addition of acoustic background noise indeed equalised performance between conditions. Still, the between-subject variability in performance was larger in the visual condition (between 31.7% and 98.3%), in line with the notion that lip reading abilities vary considerably across individuals (Bernstein and Liebenthal, 2014; Summerfield, 1992; Tye-Murray et al., 2014). Due to the near ceiling performance (above 95% correct), the data from three participants in the visual condition had to be excluded from the neuro-behavioural analysis. Participants also performed the task with auditory and visual stimuli presented at the same time (audiovisual condition), but because performance in this condition was near ceiling (Maudiovisual = 96.4%, SD = 3.3%), we present the corresponding data only in the supplementary material (Figure 2—figure supplement 2A).

Figure 1 with 2 supplements see all
Trial structure and behavioural performance.

(A) Trial structure was identical in the auditory and visual conditions. Participants listened to stereotypical sentences while a fixation dot was presented (auditory condition) or watched videos of a speaker saying sentences (visual condition). The face of the speaker is obscured for visualisation here only. After each trial, a prompt on the screen asked which adjective (or number) appeared in the sentence and participants chose one of four alternatives by pressing a corresponding button. Target words (here ‘beautiful’) occupied the 2nd or 3rd last position in the sentence. (B) Participants’ behavioural performance in auditory (blue) and visual (orange) conditions, and their individual SNR values (grey) used for the auditory condition. Dots represent individual participants (n = 20), boxes denote median and interquartile ranges, whiskers denote minima and maxima (no outliers present). MEG data of two participants (shaded in a lighter colour) were not included in neural analyses due to excessive artefacts. Participants exceeding a performance of 95% correct (grey line) were excluded from the neuro-behavioural analysis (which was the case for three participants in the visual condition). (C) Example sentence with target adjective marked in blue.

An explorative representational similarity analysis (RSA) (Evans and Davis, 2015; Kriegeskorte et al., 2008) indicated that participants’ behavioural responses were influenced by both semantic and phonological features in both conditions (see Materials and methods, and Figure 1—figure supplement 1). A repeated-measurements ANOVA yielded a main effect of condition (F(1,19) = 7.53, p=0.013; mean correlations: Mauditory = 0.38, SEM = 0.01; Mvisual = 0.43, SEM = 0.02) and a main effect of features (F(1,19) = 20.98, p<0.001, Mphon = .43, SEM = 0.01; Msem = 0.37, SEM = 0.01). A post-hoc comparison revealed that in both conditions phonological features influenced behaviour stronger than semantic features (Wilcoxon Signed-ranks test; Zauditory = 151, p=0.037, Zvisual = 189, p=0.002). While the small number of distinct word identities used here (n = 9 in two categories) precludes a clear link between these features and the underlying brain activity, these results suggest that participants’ performance was also driven by non-semantic information.

Decoding word identity from MEG source activity

Using multivariate classification, we quantified how well the single-trial identity of the target words (18 target words, each repeated 10 times) could be correctly predicted from source-localised brain activity (‘stimulus classifier’). Classification was computed in source space at the single-subject level in a 500 ms window aligned to the onset of the respective target word. Importantly, for each trial we computed classification performance within the subset of the four presented alternative words in each trial, on which participants performed their behavioural judgement. We did this to be able to directly link neural representations of word identity with perception in a later analysis. We first quantified how well brain activity encoded the word identity regardless of behaviour (‘stimulus-classification’; c.f. Materials and methods). The group-level analysis (n = 18 participants with usable MEG, cluster-based permutation statistics, corrected at p=0.001 FWE) revealed significant stimulus classification performance in both conditions within a widespread network of temporal, occipital and frontal regions (Figure 2).

Figure 2 with 3 supplements see all
Word classification based on MEG activity regardless of behavioural performance (‘stimulus classification’).

Surface projections show areas with significant classification performance at the group level (n = 18; cluster-based permutation statistics, corrected at p<0.001 FWE). Results show strongest classification in temporal regions for the auditory condition (A) and occipital areas for the visual condition (B). Cluster peaks are marked with dots. Panel (C) overlays the significant effects from both conditions, with the overlap shown in green. The overlap contains bilateral regions in middle and inferior temporal gyrus, the inferior frontal cortex and dorsal regions of the postcentral and supramarginal gyrus (SMG). The peak of the overlap is in the postcentral gyrus. (D) Grid point-wise Bayes factors for a difference between auditory and visual word classification performance for all grid points in the ROIs characterised by a significant contribution to word classification in at least one modality in panel A or B (red: evidence for a difference between conditions, that is in favour of H1 [alternative hypothesis]; blue: evidence for no difference between conditions, that is in favour of H0 [null hypothesis]. RO – Rolandic Operculum; POST – postcentral gyrus; IFG – inferior frontal gyrus; OCC – occipital gyrus).

Auditory speech was represented bilaterally in fronto-temporal areas, extending into intra-parietal regions within the left hemisphere (Figure 2A; Table 1). Cluster-based permutation statistics yielded two large clusters: a left-lateralised cluster peaking in inferior postcentral gyrus (left POST; Tsum = 230.42, p<0.001), and a right-lateralised cluster peaking in the Rolandic operculum (right RO; Tsum = 111.17, p<0.001). Visual speech was represented bilaterally in occipital areas, as well as in left parietal and frontal areas (Figure 2B), with classification performance between 25.9% and 33.9%. There were three clusters: a large bilateral posterior cluster that peaked in the left calcarine gyrus (left OCC; Tsum = 321.78, p<0.001), a left-hemispheric cluster that peaked in the inferior frontal gyrus (left IFG; Tsum = 10.98, p<0.001), and a left-hemispheric cluster that peaked in the postcentral gyrus (left POST; Tsum = 35.83, p<0.001). The regions representing word identity in both visual and auditory conditions overlapped in the middle and inferior temporal gyrus, the postcentral and supramarginal gyri, and the left inferior frontal gyrus (Figure 2C; overlap in green). MNI coordinates of cluster peaks and the corresponding classification values are given in Table 1. Results for the audiovisual condition essentially mirror the unimodal findings and exhibit significant stimulus classification in bilateral temporal and occipital regions (Figure 2—figure supplement 2B).

Table 1
Peak effects of stimulus classification performance based on MEG activity.

Labels are taken from the AAL atlas (Tzourio-Mazoyer et al., 2002). For each peak, MNI coordinates, and classification performance (mean and SEM) are presented. Chance level for classification was 25%. Abbreviations as used in Figure 2 are given in parentheses.

Atlas labelMNI coordinatesClassification % (SEM)
XYZ
Auditory peaks
Rolandic Oper R (RO)41 –142028.89 (0.78)
Postcentral L (POST)−48–212529.04 (1.00)
Visual peaks
Calcarine L (OCC)−5–101−733.92 (1.53)
Frontal Inf Tri L (IFG)−4823126.70 (0.83)
Postcentral L (POST)−51–244726.85 (1.02)
Peak of overlap
Postcentral L (POST)−47–155226.50 (0.67)

To directly investigate whether regions differed in their classification performance between visual and auditory conditions, we performed two analyses. First, we investigated the evidence for or against the null hypothesis of no condition difference for all grid points contributing to word classification in at least one modality (i.e. the combined clusters derived from Figure 2A,B). The respective Bayes factors for each ROI (from a group-level t-test) are shown in Figure 2D. These revealed no conclusive evidence for many grid points within these clusters (1∕3 < bf10<3). However, both auditory clusters and the occipital visual cluster contained grid points with substantial or strong (bf10 > 3 and bf10 > 10, respectively) evidence for a significant modality difference. In contrast, the visual postcentral region (POST), the IFG and the overlap region contained many grid points with substantial evidence for no difference between modalities (1∕10 < bf10<1∕3). Second, we performed a full-brain cluster-based permutation test for a modality difference. The respective results, masked by the requirement of significant word classification in at least one modality, are shown in Figure 2—figure supplement 1A. Auditory classification was significantly better in clusters covering left and right auditory cortices, while visual classification was significantly better in bilateral visual sensory areas. Full-brain Bayes factors confirm that, apart from sensory areas exhibiting strong evidence for a modality difference, many grid points show substantial evidence for no modality difference, or inconclusive evidence (Figure 2—figure supplement 1B).

Given that most clusters were found in only one hemisphere, we performed a direct test on whether these effects are indeed lateralised in a statistical sense (c.f. Materials and methods). We found evidence for a statistically significant lateralisation for both auditory clusters (left cluster peaking in POST: t(17) = 5.15, pFDR <0.001; right cluster peaking in RO: t(17) = 4.26, pFDR <0.01). In the visual condition, the lateralisation test for the two left clusters reached only marginal significance (left cluster peaking in IFG: t(17) = 2.19, pFDR = 0.058; left cluster peaking in POST: t(17) = 1.87, pFDR = 0.078). Note that the large occipital cluster in the visual condition is bilateral and we therefore did not test this for a lateralisation effect. Collectively, these analyses provide evidence that distinct frontal, occipital and temporal regions represent word identity specifically for visual and acoustic speech, while also providing evidence that regions within inferior temporal and frontal cortex, the SMG and dorsal post-central cortex reflect word identities in both modalities.

Cerebral speech representations that are predictive of comprehension

The above analysis leaves it unclear which of these cerebral representations of word identity are actually relevant for single-trial word comprehension. That is, it remains unclear, which cerebral activations reflect the word identity in a manner that directly contributes to, or at least correlates with, participants' performance on the task. To directly address this, we computed an index of how strongly the evidence for a specific word identity in the neural single-trial word representations is predictive of the participant’s response. We regressed the evidence in the cerebral classifier for word identity against the participants’ behaviour (see Materials and methods). The resulting neuro-behavioural weights (regression betas) were converted into t-values for group-level analysis. The results in Figure 3 (two-sided cluster-based permutation statistics, corrected at p=0.05 FWE) reveal several distinct regions in which neural representations of word identity are predictive of behaviour. In the auditory condition, we found five distinct clusters. Three were in the left hemisphere, peaking in the left inferior temporal gyrus (left ITG; Tsum = 469.55, p<0.001), the inferior frontal gyrus (left IFG; Tsum = 138.70, p<0.001), and the middle occipital gyrus (left MOG; Tsum = 58.44, p<0.001). In the right hemisphere, the two significant clusters were in the supplementary motor area (right SMA; Tsum = 312.48, p<0.001) and in the angular gyrus (right AG; Tsum = 68.59, p<0.001; Figure 3A). In the visual condition, we found four clusters: A left-hemispheric cluster in the inferior frontal gyrus (left IFG; Tsum = 144.15, p<0.001) and three clusters with right-hemispheric peaks, in the superior temporal gyrus (right STG; Tsum = 168.68, p<0.001), the superior frontal gyrus (right SFG; Tsum = 158.39, p<0.001) and the angular gyrus (right AG; Tsum = 37.42, p<0.001; Figure 3B). MNI coordinates of cluster peaks and the corresponding beta and t-values are given in Table 2. Interestingly, these perception-relevant (i.e. predictive) auditory and visual representations did not overlap (Figure 3C), although some of them occurred in adjacent regions in the IFG and AG.

Figure 3 with 2 supplements see all
Cortical areas in which neural word representations predict participants’ response.

Coloured areas denote significant group-level effects (surface projection of the cluster-based permutation statistics, corrected at p<0.05 FWE). (A) In the auditory condition (n = 18), we found five clusters (cluster peaks are marked with dots). Three were in left ventral regions, in the inferior frontal gyrus, the inferior temporal gyrus, and the occipital gyrus, the other two were in the right hemisphere, in the angular gyrus and the supplementary motor area. (B) In the visual condition (n = 15; three participants were excluded due to near ceiling performance), we found four clusters: In the left (dorsal) inferior frontal gyrus, the right anterior cingulum stretching to left dorsal frontal regions, in the right angular gyrus and the right superior temporal gyrus (all peaks are marked with dots). Panel (C) overlays the significant effects from both conditions. There was no overlap. However, both auditory and visual effects were found in adjacent regions within the left IFG and the right AG. Panel (D) shows distributions of grid point-wise Bayes factors for a difference between auditory and visual conditions for these clusters (red: evidence for differences between conditions, that is in favour of H1 [alternative hypothesis]; blue: evidence for no difference between conditions, that is in favour of H0 [null hypothesis]).

Table 2
Peak effects for the neuro-behavioural analysis.

Labels are taken from the AAL atlas (Tzourio-Mazoyer et al., 2002). For each local peak, MNI coordinates, regression beta (mean and SEM across participants) and corresponding t-value are presented. Abbreviations as used in Figure 3 are given in parentheses.

Atlas labelMNI coordinatesBeta (SEM)t-value
XYZ
Auditory
Temporal Inf L (ITG)−41– 23−260.106 (0.024)4.40
Frontal Inf Orb L (IFG)−2825–90.082 (0.031)2.66
Occipital Mid L, Occipital Inf L (MOG)−46–83−40.079 (0.029)2.75
Supp Motor Area R (SMA)311520.089 (0.027)3.33
Angular R (AG)49–67400.079 (0.027)2.87
Visual
Frontal Inf Tri L (IFG)−573040.075 (0.017)4.34
Frontal Sup Medial R, Cingulum Ant R (SFG)947150.080 (0.028)2.86
Temporal Sup R (STG)38–30100.086 (0.023)3.77
Angular R (AG)60–55340.073 (0.020)3.55

IFG – inferior frontal gyrus; MOG – middle occipital gyrus; AG – angular gyrus; SMA – supplementary motor area; ITG – inferior temporal gyrus; IFG – inferior frontal gyrus; STG – superior temporal gyrus; SFG – superior frontal gyrus.

Again, we asked whether the behavioural relevance of these regions exhibit a significant bias towards either modality by investigating the between-condition contrast for all clusters that are significantly predictive of behaviour (Bayes factors derived from the group-level t-test; Figure 3D). Three auditory clusters contained grid points that differed substantially or strongly (bf10 > 3 and bf10 > 10, respectively) between modalities (left ITG, left MOG, and right AG). In addition, in two regions the majority of grid points provided substantial evidence for no difference between modalities (IFG and AG from the visual condition). A separate full-brain cluster-based permutation test (Figure 3—figure supplement 1A) provided evidence for a significant modality specialisation for auditory words in four clusters in the left middle occipital gyrus, left calcarine gyrus, right posterior angular gyrus, and bilateral supplementary motor area. The corresponding full-brain Bayes factors (Figure 3—figure supplement 1B) support this picture but also provide no evidence for a modality preference, or inconclusive results, in many other regions. Importantly, those grid points containing evidence for a significant modality difference in the full brain analysis (Figure 3—figure supplement 1B) correspond to those auditory ROIs derived in Figure 3A (MOG, posterior AG and SMA). On the other hand, regions predictive of visual word comprehension did not show a significant modality preference. Regarding the lateralisation of these clusters, we found that corresponding betas in the contralateral hemisphere were systematically smaller in all clusters but did not differ significantly (all pFDR ≥0.15), hence providing no evidence for a strict lateralisation.

To further investigate whether perception-relevant auditory and visual representations are largely distinct, we performed a cross-decoding analysis, in which we directly quantified whether the activity patterns of local speech representations are the same across modalities. At the whole-brain level, we found no evidence for significant cross-classification (two-sided cluster-based permutation statistics, neither at a corrected p=0.001 nor a more lenient p=0.05; Figure 2—figure supplement 3A, left panel). That significant cross-classification is possible in principle from the data, is shown by the significant results for the audiovisual condition (Figure 2—figure supplement 3A, right panel). An analysis of the Bayes factors for this cross-classification test confirmed that most grid points contained substantial evidence for no cross-classification between the auditory and visual conditions (Figure 2—figure supplement 3B, left panel). On the other hand, there was strong evidence for significant cross-classification between the uni- and multisensory conditions in temporal and occipital regions (Figure 2—figure supplement 3B, right panel).

Strong sensory representations do not necessarily predict behaviour

The above results suggest that the brain regions in which sensory representations shape speech comprehension are mostly distinct from those allowing the best prediction of the actual stimulus (see Figure 4 for direct visualisation of the results from both analyses). Only parts of the left inferior temporal gyrus (auditory condition), the right superior temporal gyrus (visual condition) and the left inferior frontal gyrus (both conditions) feature high stimulus classification and a driving role for comprehension. In other words, the accuracy by which local activity reflects the physical stimulus is generally not predictive of the impact of this local word representation on behaviour. To test this formally, we performed within-participant robust regression analyses between the overall stimulus classification performance and the perceptual weight of each local representation across all grid points. Group-level statistics of the participant-specific beta values provided no support for a consistent relationship between these (auditory condition: b = 0.05 ± 0.11 [M ± SEM], t(17) = 0.50, pFDR = 0.81; visual condition: b = 0.03 ± 0.11 [M ± SEM], t(14) = 0.25, pFDR = 0.81; participant-specific regression slopes are depicted in Figure 4B). A Bayes factor analysis also provided substantial evidence for no consistent relationship (bf10 = 0.27 and bf10 = 0.27, for auditory and visual conditions, respectively).

Largely distinct regions provide strong stimulus classification and mediate behavioural relevance.

(A) Areas with significant stimulus classification (from Figure 2) are shown in yellow, those with significant neuro-behavioural results (from Figure 3) in green, and the overlap in blue. The overlap in the auditory condition (N = 14 grid points) comprised parts of the left inferior and middle temporal gyrus (ITG), and the orbital part of the left inferior frontal gyrus (IFG). The overlap in the visual condition (N = 27 grid points) comprised the triangular part of the inferior frontal gyrus (IFG), and parts of the superior temporal gyrus (STG), extending dorsally to the Rolandic operculum. (B) Results of a regression between word classification performance and neurobehavioural weights. Individual participant’s slopes and all individual data points are shown. A group-level t-test on betas yielded no consistent relationship (both ts <0.50, both ps = 81). Corresponding Bayes factors (both bf10s < 1/3) provide substantial evidence for no consistent relationship.

Still, this leaves it unclear whether variations in the strength of neural speech representations (i.e. the ‘stimulus classification’) can explain variations in the behavioural differences between participants. We therefore correlated the stimulus classification performance for all grid points with participants’ behavioural data, such as auditory and lip-reading performance, and the individual SNR value. We found no significant clusters (all ps > 0.11, two-sided cluster-based permutation statistics, uncorrected across the four tests), indicating that stimulus classification performance was not significantly correlated with behavioural performance across participants (Figure 3—figure supplement 2A). The corresponding Bayes factors confirm that in the large majority of brain regions, there is substantial evidence for no correlation (Figure 3—figure supplement 2B).

Discussion

Acoustic and visual speech are represented in distinct brain regions

Our results show that the cerebral representations of auditory and visual speech are mediated by both modality-specific and overlapping (potentially amodal) representations. While several parietal, temporal and frontal regions were engaged in the encoding of both acoustically and visually conveyed word identities (‘stimulus classification’), comprehension in both sensory modalities was driven mostly by distinct networks. Only the inferior frontal and anterior angular gyrus contained adjacent regions that contributed to both auditory and visual comprehension.

This multi-level organisation of auditory and visual speech is supported by several of our findings. First, we found a partial intersection of the sensory information, where significant word classification performance overlapped in bilateral postcentral regions, inferior temporal and frontal regions and the SMG. On the other hand, auditory and visual cortices represent strongly modality-specific word identities. Second, it is also supported by the observation that anterior angular and inferior frontal regions facilitate both auditory and visual comprehension, while distinct regions support modality-specific comprehension. In particular, our data suggest that middle occipital and posterior angular representations specifically drive auditory comprehension. In addition, superior frontal and temporal regions are engaged in lip reading, although we did not find strong evidence for a modality preference of these regions. None of these comprehension-relevant regions was strictly lateralised, in line with the notion that speech comprehension is largely a bilateral process (Kennedy-Higgins et al., 2020).

The inability to cross-classify auditory and visual speech from local activity further supports the conclusion that the nature of local representations of acoustic and visual speech is relatively distinct. It is important to note that cross-classification probes not only the spatial overlap of two representations but also asks whether the local spatio-temporal activity patterns encoding word identity in the two sensory modalities are the same. It could be that local activity encodes a given word based on acoustic or visual evidence but using distinct activity patterns. Representations could therefore spatially overlap without using the same ‘neural code’. Our results hence provide evidence that the activity patterns by which auditory and visual speech are encoded may be partly distinct, even within regions that represent both acoustically and visually mediated word information, such as the inferior frontal and anterior angular gyrus.

While we found strongest word classification performance in sensory areas, significant classification also extended into central, frontal and parietal regions. This suggests that the stimulus-domain classifier used here may also capture processes potentially related to attention or motor preparation. While we cannot rule out that levels of attention differed between conditions, we ensured by experimental design that comprehension performance did not differ between modalities. In addition, the relevant target words were placed not at the end of the sentence to prevent motor planning and preparation during their presentation (see Stimuli).

The encoding of visual speech

The segregation of comprehension-relevant auditory and visual representations provides a possible explanation for the finding that auditory or verbal skills and visual lip reading are uncorrelated in normal-hearing adults (Jeffers and Barley, 1980; Mohammed et al., 2006; Summerfield, 1992). Indeed, it has been suggested that individual differences in lip reading represent something other than normal variation in speech perceptual abilities (Summerfield, 1992). For example, lip reading skills are unrelated to reading abilities in the typical adult population (Arnold and Köpsel, 1996; Mohammed et al., 2006), although a relationship is sometimes found in deaf or dyslexic children (Arnold and Köpsel, 1996; de Gelder and Vroomen, 1998; Kyle et al., 2016).

Previous imaging studies suggested that silent lip reading engages similar auditory regions as engaged by acoustic speech (Bourguignon et al., 2020; Calvert, 1997; Calvert and Campbell, 2003; Capek et al., 2008; MacSweeney et al., 2000; Paulesu et al., 2003; Pekkola et al., 2005), implying a direct route for visual speech into the auditory pathways and an overlap of acoustic and visual speech representations in these regions (Bernstein and Liebenthal, 2014). Studies comparing semantic representations from different modalities also supported large modality-independent networks (Fairhall and Caramazza, 2013; Shinkareva et al., 2011; Simanova et al., 2014). Yet, most studies have focused on mapping activation strength rather than the encoding of word identity by cerebral speech representations. Hence, it could be that visual speech may activate many regions in an unspecific manner, without engaging specific semantic or lexical representations, maybe as a result of attentional engagement or feed-back (Balk et al., 2013; Ozker et al., 2018). Support for this interpretation comes from lip reading studies showing that auditory cortical areas are equally activated by visual words and pseudo-words (Calvert, 1997; Paulesu et al., 2003), and studies demonstrating cross-modal activations in early sensory regions also for simplistic stimuli (Ferraro et al., 2020; Ibrahim et al., 2016; Petro et al., 2017).

Our results suggest that visual speech comprehension is mediated by parietal and inferior frontal regions that likely contribute to both auditory and visual speech comprehension, but also engage superior temporal and superior frontal regions. Thereby our results support a route of visual speech into auditory cortical and temporal regions but provide no evidence for an overlap of speech representations in the temporal lobe that would facilitate both lip-reading and acoustic speech comprehension, in contrast to recent suggestions from a lesion-based approach (Hickok et al., 2018).

Two specific regions mediating lip-reading comprehension were the IFG and the anterior angular gyrus. Our results suggest that these facilitate both auditory and visual speech comprehension, in line with previous suggestions (Simanova et al., 2014). Previous work has also implicated these regions in the visual facilitation of auditory speech-in-noise perception (Giordano et al., 2017) and lip-reading itself (Bourguignon et al., 2020). Behavioural studies have shown that lip-reading drives the improvement of speech perception in noise (Macleod and Summerfield, 1987), hence suggesting that the representations of visual speech in these regions may be central for hearing in noisy environments. Interestingly, these regions resemble the left-lateralised dorsal pathway activated in deaf signers when seeing signed verbs (Emmorey et al., 2011). Our results cannot directly address whether these auditory and visual speech representations are the same as those that mediate the multisensory facilitation of speech comprehension in adverse environments (Bishop and Miller, 2009; Giordano et al., 2017). Future work needs to directly contrast the degree of which multisensory speech representations overlap locally to the ability of these regions to directly fuse this information.

Cross-modal activations in visual cortex

We also found that acoustic comprehension was related to occipital brain activity (c.f. Figure 3). Previous work has shown that salient sounds activate visual cortices (Feng et al., 2014; McDonald et al., 2013), with top-down projections providing visual regions with semantic information, for example about object categories (Petro et al., 2017; Revina et al., 2018). The acoustic speech in the present study was presented in noise, and performing the task hence required attentional effort. Attention may therefore have automatically facilitated the entrance of top-down semantic information into occipital regions that, in a multisensory context, would encode the lip-movement trajectory, in order to maximise task performance (McDonald et al., 2013). The lack of significant cross-classification performance suggests that the nature of this top-down induced representation differs from that induced by direct lip-movement information.

Sub-optimal sensory representations contribute critically to behaviour

To understand which cerebral representations of sensory information guide behaviour, it is important to dissociate those that mainly correlate with the stimulus from those that encode sensory information and guide behavioural choice. At the single neuron level some studies have proposed that only those neurons encoding the specific stimulus optimally are driving behaviour (Britten et al., 1996; Pitkow et al., 2015; Purushothaman and Bradley, 2005; Tsunada et al., 2016), while others suggest that ‘plain’ sensory information and sensory information predictive of choice can be decoupled across neurons (Runyan et al., 2017). Theoretically, these different types of neural representations can be dissected by considering the intersection of brain activity predictive of stimulus and choice (Panzeri et al., 2017), that is, the neural representations that are informative about the sensory environment and are used to guide behaviour. While theoretically attractive, this intersection is difficult to quantify for high-dimensional data, in part as direct estimates of this intersection, for example based on information-theoretic approaches, are computationally costly (Pica et al., 2017). Hence, in the past most studies, also on speech, have focused on either studying sensory encoding (e.g. by classifying stimuli), or behaviourally predictive activity only (e.g. by classifying responses). However, the former type of cerebral representation may not guide behaviour at all, while the latter may also capture brain activity that drives perceptual errors due to intrinsic fluctuations in sensory pathways, the decision process, or even noise in the motor system (Grootswagers et al., 2018).

To directly quantify where auditory or visual speech is represented and this representation is used to guide comprehension we capitalised on the use of a stimulus-classifier to first pinpoint brain activity carrying relevant word-level information and to then test where the quality of the single-trial word representation is predictive of participants’ comprehension (Cichy et al., 2017; Grootswagers et al., 2018; Ritchie et al., 2015). This approach directly follows the idea to capture processes related to the encoding of external (stimulus-driven) information and to then ask whether these representations correlate over trials with the behavioural outcome or report. Although one has to be careful in interpreting this as causally driving behaviour, our results reveal that brain regions allowing for a sub-optimal read-out of the actual stimulus are predictive of the perceptual outcome, whereas those areas allowing the best read-out not necessarily predict behaviour. This dissociation is emerging in several recent studies on the neural basis underlying perception (Bouton et al., 2018; Grootswagers et al., 2018; Hasson et al., 2007; Keitel et al., 2018). Importantly, it suggests that networks mediating speech comprehension can neither be understood by mapping speech representations during passive perception nor during task performance, if the analysis itself is not geared towards directly revealing the perception-relevant representations.

On a technical level, it is important to keep in mind that the insights derived from any classification analysis are limited by the quality of the overall classification performance. Classification performance was highly significant and reached about 10% above the respective chance level, a number that is in accordance with other neuroimaging studies on auditory pathways (Bednar et al., 2017; Correia et al., 2015). Yet, more refined classification techniques, or data obtained using significantly larger stimulus sets and more repetitions of individual target words may be able to provide even more refined insights. In addition, by design of our experiment (four response options) and data analysis, the neurobehavioral analysis was primary driven by trials in which the respective brain activity encoded the sensory stimulus correctly. We cannot specifically link the incorrect encoding of a stimulus with behaviour. This is in contrast to studies using only two stimulus or response options, where evidence for one option directly provides evidence against the other (Frühholz et al., 2016; Petro et al., 2013).

One factor that may shape the behavioural relevance of local sensory representations is the specific task imposed (Hickok and Poeppel, 2007). In studies showing the perceptual relevance of optimally encoding neurons, the tasks were mostly dependent on low-level features (Pitkow et al., 2015; Tsunada et al., 2016), while studies pointing to a behavioural relevance of high level regions were relying on high-level information such as semantics or visual object categories (Grootswagers et al., 2018; Keitel et al., 2018). One prediction from our results is therefore that if the nature of the task was changed from speech comprehension to an acoustic task, the perceptual relevance of word representations would shift from left anterior regions to strongly word encoding regions in the temporal and supramarginal regions. Similarly, if the task would concern detecting basic kinematic features of the visual lip trajectory, activity within early visual cortices tracking the stimulus dynamics should be more predictive of behavioural performance (Di Russo et al., 2007; Keitel et al., 2019; Tabarelli et al., 2020). This suggests that a discussion of the relevant networks underlying speech perception should always be task-focused.

Conclusion

These results suggest that cerebral representations of acoustic and visual speech might be more modality-specific than often assumed and provide a neural explanation for why acoustic speech comprehension is a poor predictor of lip-reading skills. Our results also suggest that those cerebral speech representations that directly drive comprehension are largely distinct from those best representing the physical stimulus, strengthening the notion that neuroimaging studies need to more specifically quantify the cerebral mechanisms driving single-trial behaviour.

Materials and methods

Part of the dataset analysed in the present study has been used in a previous publication (Keitel et al., 2018). The data analysis performed here is entirely different from the previous work and includes unpublished data.

Participants and data acquisition

Request a detailed protocol

Twenty healthy, native volunteers participated in this study (nine female, age 23.6 ± 5.8 y [M ± SD]). The sample size was set based on previous recommendations (Bieniek et al., 2016; Poldrack et al., 2017; Simmons et al., 2011). MEG data of two participants had to be excluded due to excessive artefacts. Analysis of MEG data therefore included 18 participants (seven female), whereas the analysis of behavioural data included 20 participants. An exception to this is the neurobehavioral analysis in the visual condition, where three participants performed at ceiling and had to be excluded (resulting in n = 15 participants in Figure 3B). All participants were right-handed (Edinburgh Handedness Inventory; Oldfield, 1971), had normal hearing (Quick Hearing Check; Koike et al., 1994), and normal or corrected-to-normal vision. Participants had no self-reported history of neurological or language disorders. All participants provided written informed consent prior to testing and received monetary compensation of £10/h. The experiment was approved by the ethics committee of the College of Science and Engineering, University of Glasgow (approval number 300140078), and conducted in compliance with the Declaration of Helsinki.

MEG was recorded with a 248-magnetometers, whole-head MEG system (MAGNES 3600 WH, 4-D Neuroimaging) at a sampling rate of 1 KHz. Head positions were measured at the beginning and end of each run, using five coils placed on the participants’ head. Coil positions were co-digitised with the head-shape (FASTRAK, Polhemus Inc, VT, USA). Participants sat upright and fixated a fixation point projected centrally on a screen. Visual stimuli were displayed with a DLP projector at 25 frames/second, a resolution of 1280 × 720 pixels, and covered a visual field of 25 × 19 degrees. Sounds were transmitted binaurally through plastic earpieces and 370 cm long plastic tubes connected to a sound pressure transducer and were presented stereophonically at a sampling rate of 22,050 Hz. Stimulus presentation was controlled with Psychophysics toolbox (Brainard, 1997) for MATLAB (The MathWorks, Inc) on a Linux PC.

Stimuli

Request a detailed protocol

The experiment featured three conditions: auditory only (A), visual only (V), and a third condition in which the same stimulus material was presented audiovisually (AV). This condition could not be used for the main analyses as participants performed near ceiling level in the behavioural task (correct trials: M = 96.5%, SD = 3.4%; see Figure 2—figure supplement 2A for results). The stimulus material consisted of 180 sentences, based on a set of 18 target words derived from two categories (nine numbers and nine adjectives), each repeated 10 times in a different sentence. Sentences were spoken by a trained, male, native British actor. Sentences were recorded with a high-performance camcorder (Sony PMW-EX1) and external microphone. The speaker was instructed to speak clearly and naturally. Each sentence had the same linguistic structure (Keitel et al., 2018). An example is: ‘Did you notice (filler phase), on Sunday night (time phrase) Graham (name) offered (verb) ten (number) fantastic (adjective) books (noun)”. In total, 18 possible names, verbs, numbers, adjectives, and nouns were each repeated ten times. Sentence elements were re-combined within a set of 180 sentences. As a result, sentences made sense, but no element could be semantically predicted from the previous material. To measure comprehension performance, a target word was selected that was either the adjective in one set of sentences (‘fantastic’ in the above example) or the number in the other set (for example, ‘thirty-two’). These were always the second or third last word in a sentence, ensuring that the cerebral processes encoding these words were independent from the behavioural (motor) response. All adjective target words had a positive valence (Scott et al., 2019; see table in Supplementary file 1 for all possible target words). The duration of sentences ranged from 4.2 s to 6.5 s (5.4 ± 0.4 s [M ± SD]). Noise/video onset and offset was approximately 1 s before and after the speech, resulting in stimulus lengths of 6.4 s to 8.2 s (Figure 1). The durations of target words ranged from 419 ms to 1,038 ms (679 ± 120 ms [M ± SD]). After the offset of the target words, the stimulus continued for 1.48 s to 2.81 s (1.98 ± 0.31 s [M ± SD]) before the end of the sentence.

The acoustic speech was embedded in noise to match performance between auditory and visual conditions. The noise consisted of ecologically valid, environmental sounds (traffic, car horns, talking), combined into a uniform mixture of 50 different background noises. The individual noise level for each participant was determined with a one-up-three-down (3D1U) staircase procedure that targets the 79.4% probability correct level (Karmali et al., 2016). For the staircase procedure, only the 18 possible target words (i.e. adjectives and numbers) were used instead of whole sentences. Participants were presented with a single target word embedded in noise and had to choose between two alternatives. Note that due to the necessary differences between staircase procedure (single words and two-alternative-forced-choice) and behavioural experiment (sentences and four-alternative forced-choice), the performance in the behavioural task was lower than 79.4%. The signal-to-noise ratio across participants ranged from −7.75 dB to −3.97 dB (−5.96 ± 1.06 dB [M ± SD]; see Figure 1B).

Experimental design

Request a detailed protocol

The 180 sentences were each presented in three conditions (A, V, AV), each consisting of four blocks with 45 sentences each. In each block, participants either reported the comprehended adjective or number, resulting in two ‘adjective blocks’ and two ‘number blocks’. The order of sentences and blocks was randomised for each participant. The first trial of each block was a ‘dummy’ trial that was discarded for subsequent analysis; this trial was repeated at the end of the block.

During the presentation of the sentence, participants fixated either a dot (auditory condition) or a small cross on the speaker’s mouth (see Figure 1 for depiction of trial structure). After each sentence, participants were presented with four target words (either adjectives or written numbers) on the screen and had to indicate which one they perceived by pressing one of four buttons on a button box. After 2 s, the next trial started automatically. Each block lasted approximately 10 min. The two separate sessions were completed within one week.

MEG pre-processing

Request a detailed protocol

Pre-processing of MEG data was carried out in MATLAB (The MathWorks, Inc) using the Fieldtrip toolbox (Oostenveld et al., 2011). All experimental blocks were pre-processed separately. Single trials were extracted from continuous data starting 2 s before sound/video onset and until 10 s after onset. MEG data were denoised using a reference signal. Known faulty channels (N = 7) were removed before further pre-processing. Trials with SQUID jumps (on average 3.86% of trials) were detected and removed using Fieldtrip procedures with a cut-off z-value of 30. Before further artifact rejection, data were filtered between 0.2 and 150 Hz (fourth order Butterworth filters, forward and reverse) and down-sampled to 300 Hz. Data were visually inspected to find noisy channels (4.95 ± 5.74 on average across blocks and participants) and trials (0.60 ± 1.24 on average across blocks and participants). There was no indication for a statistical difference between the number of rejected channels or trials between conditions (two-sided t-tests; p>0.48 for channels, p>0.40 for trials). Finally, heart and eye movement artifacts were removed by performing an independent component analysis with 30 principal components (2.5 components removed on average). Data were further down-sampled to 150 Hz and bandpass-filtered between 0.8 and 30 Hz (fourth order Butterworth filters, forward and reverse).

Source reconstruction

Request a detailed protocol

Source reconstruction was performed using Fieldtrip, SPM8, and the Freesurfer toolbox. We acquired T1-weighted structural magnetic resonance images (MRIs) for each participant. These were co-registered to the MEG coordinate system using a semi-automatic procedure (Gross et al., 2013; Keitel et al., 2017). MRIs were then segmented and linearly normalised to a template brain (MNI space). A forward solution was computed using a single-shell model (Nolte, 2003). We projected sensor-level timeseries into source space using a frequency-specific linear constraint minimum variance (LCMV) beamformer (Van Veen et al., 1997) with a regularisation parameter of 7% and optimal dipole orientation (singular value decomposition method). Covariance matrices for source were based on the whole length of trials (Brookes et al., 2008). Grid points had a spacing of 6 mm, resulting in 12,337 points covering the whole brain. For subsequent analyses, we selected grid points that corresponded to cortical regions only (parcellated using the AAL atlas; Tzourio-Mazoyer et al., 2002). This resulted in 6490 grid points in total.

Neural time series were spatially smoothed (Gross et al., 2013) and normalised in source space. For this, the band-pass filtered time series for the whole trial (i.e. the whole sentence) were projected into source space and smoothed using SPM8 routines with a Full-Width Half Max (FWHM) value of 3 mm. The time series for each grid point and trial was then z-scored.

Classification analysis

Request a detailed protocol

We used multi-variate single-trial classification to localise cerebral representations of the target words in source activity (Grootswagers et al., 2017; Guggenmos et al., 2018). Each target word was presented in ten different trials per condition. We extracted the 500 ms of activity following the onset of each target word and re-binned the source activity at 20 ms resolution. This choice of the analysis time window as made based on the typical duration of target words (M = 679 ms length, see Stimuli). Because the words following the target word differed in each sentence, choosing a longer window would have contaminated the specific classification of target word identity. We therefore settled on a 500 ms window, which has been shown to be sufficient for word decoding (Chan et al., 2011) and does not include the beginning of the following word in most sentences (94%). Importantly, this analysis window did not capture post-sentence or repose periods. Classification was performed on spatial searchlights of 1.2 cm radius. The typical searchlight contained 31 neighbours (median value), with 95% of searchlights containing 17 to 33 grid points. The (leave-one-trial-out) classifier computed, for a given trial, the Pearson correlation of the spatio-temporal searchlight activity in this test-trial with the activities for the same word in all other trials (within-target distances), and with the activities of the three alternative words in all trials (between-target distances). That is, each trial was classified within the sub-set of words that was available to the participant as potential behavioural choices (see Figure 1—figure supplement 2 for illustration). We then averaged correlations within the four candidate words and decoded the target trial as the word identity with the strongest average correlation (that is, smallest classifier distance). This classification measure is comparable to previous studies probing how well speech can be discriminated based on patterns of dynamic brain activity (Luo and Poeppel, 2007; Rimmele et al., 2015). Classification performance was averaged across blocks with numbers and adjectives as task-relevant words. For cross-condition classification (Figure 2—figure supplement 3), we classified the single-trial activity from the auditory (visual) condition against all trials with the same word alternatives from the other condition, or from the audiovisual condition.

Selection of parameters and classifier procedures

Request a detailed protocol

We initially tested a number of different classifiers, including linear-discriminant and diagonal-linear classifiers, and then selected a correlation-based nearest-neighbour classifier as this performed slightly better than the others (although we note that the difference in peak classification performance was only on the range of 2–3% between different classifiers). We focussed on linear classifiers here because these have been shown to often perform equally well than more complex non-linear classifiers, while also offering insights that are more readily interpretable (Haxby et al., 2014; Kamitani and Tong, 2005; Ritchie et al., 2019).

To assess the impact of the temporal binning of MEG activity, we probed classification performance based on bins of 3.3, 20, 40 and 60 ms length. Classification performance dropped slightly when sampling the data at a resolution lower than 20 ms, particularly for auditory classification (for 3.3, 20, 40 and 60 ms bins, the mean performance of the 10% grid points with the highest values in the benchmark 20 ms classification was: auditory, 27.19 ± 0.48%, 26.81 ± 0.86%, 26.54 ± 1.00% and 25.92 ± 0.85%; visual, 28.71 ± 1.55%, 28.68 ± 1.73%, 28.54 ± 1.68% and 27.85 ± 2.04% [M ± SD]).

We also probed the influence of the spatial searchlight by (i) including each neighbouring spatial grid point into the searchlight, or (ii) averaging across grid points, and (iii) by not including a searchlight altogether. Ignoring the spatial pattern by averaging grid points led to a small drop in classification performance (individual vs average grid points: auditory, 27.19 ± 0.48 vs 26.72 ± 0.67; visual, 28.71 ± 1.55 vs 27.71 ± 1.25 [M ± SD]). Performance also dropped slightly when no searchlight was included (auditory, 26.77 ± 1.93; visual, 27.86 ± 2.50 [M ± SD]).

For the main analysis, we therefore opted for a classifier based on the MEG source data represented as spatial searchlight including each grid point within a 1.2 cm radius, and binned at 20 ms resolution.

Quantifying the behavioural relevance of speech representations

Request a detailed protocol

To quantify the degree to which the classifier evidence obtained from local speech representations in favour of a specific word identity is predictive of participants' comprehension, we extracted an index of how well the classifier separated the correct word identity from the three false alternatives: the distance of the single trial classifier evidence to a decision bound (Cichy et al., 2017; Grootswagers et al., 2018; Ritchie et al., 2015). This representational distance was defined as the average correlation with trials of the same (within-target distances) word identity minus the mean of the correlation with the three alternatives (between-target distances; see Figure 1—figure supplement 2). If a local cerebral representation allows a clear and robust classification of a specific word identity, this representational distance would be large, while if a representation allows only for poor classification, or mis-classifies a trial, this distance will be small or negative. We then quantified the statistical relation between participants performance (accuracy) and these single-trial representational distances (Cichy et al., 2017; Grootswagers et al., 2018; Panzeri et al., 2017; Pica et al., 2017; Ritchie et al., 2015). This analysis was based on a regularised logistic regression (Parra et al., 2005), which was computed across all trials per participant. To avoid biasing, the regression model was computed across randomly selected subsets of trials with equal numbers of correct and wrong responses, averaging betas across 50 randomly selected trials. The resulting beta values were averaged across blocks with numbers and adjectives as targets and were entered into a group-level analysis. Given the design of the task (four response options, around 70% correct performance), this analysis capitalises on the relation between correctly encoded words (positive representational distance) and their relation to performance. Conversely, it is not directly able to capture how a wrongly encoded word identity relates to performance.

Quantifying the role of phonological and semantic features to perception

Request a detailed protocol

For each pair of words we computed their phonological distance using the Phonological Corpus Tools (V1.4.0) based on the phonetic string similarity (’phonological edit distance’) derived from the transcription tier, using the Irvine Phonotactic Online Dictionary (Vaden et al., 2009). We also computed pairwise semantic distances using the fastTExt vector representation of English words trained on Common Crawl and Wikipedia obtained online (file cc.en.300.vec) (Grave et al., 2018). The individual word vectors (300 dimensions) were length-normalised and cosine distances were computed. For each participant, we obtained a behavioural representational dissimilarity matrix (RDM) as the pair-wise behavioural confusion matrix from their behavioural data. We then implemented a representational similarity analysis (RSA) (Kriegeskorte et al., 2008) between phonological (semantic) representations and participants’ performance. Specifically, behavioural and semantic (phonetic) RDMs were compared using Spearman’s rank correlation. The resulting correlations were z-scored and averaged across adjectives and numbers (see Figure 1—figure supplement 1).

Statistical analyses

Request a detailed protocol

To test the overall stimulus classification performance, we averaged the performance per grid point across participants and compared this group-averaged value to a group-average permutation distribution obtained from 3000 within-subject permutations derived with random trial labels. Cluster-based permutation was used to correct for multiple comparisons (Maris and Oostenveld, 2007). Significant clusters were identified based on a first-level significance derived from the 99.95th percentile of the permuted distribution (family-wise error [FWE] of p=0.001), using the summed statistics (Tsum) across grid points within a cluster, and by requiring a minimal cluster size of 10 grid points. The resulting clusters were considered if they reached a p-value smaller than 0.05.

For the neuro-behavioural analyses, the regression betas obtained from the logistic regression were transformed into group-level t-values. These were compared with a surrogate distribution of t-values obtained from 3000 within-subject permutations using shuffled trial labels and using cluster-based permutations as above. The first-level significance threshold (at p<0.05) was determined per condition based on the included sample size (t-value of t = 2.1 for the 18 participants in the auditory condition and t = 2.2 for 15 participants in the visual condition), and the resulting clusters were considered significant if they reached a p-value smaller than 0.05.

Resulting clusters were tested for lateralisation (Liégeois et al., 2002; Park and Kayser, 2019). For this, we extracted the participant-specific classification performance (or regression betas, respectively) for each cluster and for the corresponding contralateral grid points. These values were averaged within each hemisphere and the between-hemispheric difference was computed using a group-level, two-sided t-test. Resulting p-values were corrected for multiple comparisons by controlling the FDR at p≤0.05 (Benjamini and Hochberg, 1995). We only use the term ‘lateralised’ if the between-hemispheric difference is statistically significant.

To determine whether individual local effects (e.g. stimulus classification or behavioural prediction) were specific to either condition, we implemented a direct contrast between conditions. For each grid point, we computed a group-level t-test. We then subjected these to the same full-brain cluster-based permutation approach as described above. In addition, we converted the group-level t-values to a JZS Bayes factor using a default scale factor of 0.707 (Rouder et al., 2009). We then quantified the number of grid points per region of interest that exhibited a specific level of evidence in favour of the null hypothesis of no effect versus the alternative hypothesis (H0 vs H1) (Jeffreys, 1998). Using previous conventions (Wagenmakers et al., 2011), the Bayes factors were interpreted as showing evidence for H1 if they exceeded a value of 3, and evidence for H0 if they were below ⅓, with the intermediate range yielding inconclusive results. We also calculated Bayes factors from Pearson correlation coefficients (for a control analysis between classification performance and behavioural data), using the same conventions (Wetzels and Wagenmakers, 2012).

To investigate the relationship between stimulus classification and neurobehavioral results, we performed a robust linear regression within each participant for all grid points. The participant-specific beta values were then tested against zero using a two-sided t-test (Keitel et al., 2017).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
    The processing of audio-visual speech: empirical and neural bases
    1. R Campbell
    (2008)
    Philosophical Transactions of the Royal Society B: Biological Sciences 363:1001–1010.
    https://doi.org/10.1098/rstb.2007.2155
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
    Speechreading (Lipreading): Charles C
    1. J Jeffers
    2. M Barley
    (1980)
    Thomas Publisher.
  48. 48
    The Theory of Probability
    1. H Jeffreys
    (1998)
    OUP Oxford Press.
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
    Contextual modulation of primary visual cortex by auditory signals
    1. LS Petro
    2. AT Paton
    3. L Muckli
    (2017)
    Philosophical Transactions of the Royal Society B: Biological Sciences 372:20160104.
    https://doi.org/10.1098/rstb.2016.0104
  81. 81
    Quantifying how much sensory information in a neural code is relevant for behavior
    1. G Pica
    2. E Piasini
    3. H Safaai
    4. C Runyan
    5. C Harvey
    6. M Diamond
    7. S Panzeri
    (2017)
    Advances in Neural Information Processing Systems.
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
    Visual perception of phonetic gestures. paper presented at the modularity and the motor theory of speech perception
    1. Q Summerfield
    (1991)
    A Conference to Honor Alvin M. Liberman.
  101. 101
    Lipreading and audiovisual Speech-Perception
    1. Q Summerfield
    (1992)
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 335:71–78.
    https://doi.org/10.1098/rstb.1992.0009
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
    Irvine Phonotactic Online Dictionary, version 2.0
    1. KI Vaden
    2. H Halpin
    3. GS Hickok
    (2009)
    Irvine Phonotactic Online Dictionary.
  107. 107
  108. 108
  109. 109
  110. 110

Decision letter

  1. Tobias Reichenbach
    Reviewing Editor; Imperial College London, United Kingdom
  2. Barbara G Shinn-Cunningham
    Senior Editor; Carnegie Mellon University, United States
  3. Tobias Reichenbach
    Reviewer; Imperial College London, United Kingdom
  4. Matthew H Davis
    Reviewer; University of Cambridge, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

We can understand speech not only through hearing the sound, but also through reading from somebody's lips. In this study, Keitel et al. show that the identity of words is encoded by similar neural networks, whether they are presented through auditory or through visual signals. However, the comprehension of speech in the two modalities involves largely different brain areas, suggesting that the neural mechanisms for auditory and for visual speech comprehension differ more than previously believed.

Decision letter after peer review:

Thank you for submitting your work entitled "Largely distinct networks mediate perceptually-relevant auditory and visual speech representations" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tobias Reichenbach as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Matthew H Davis (Reviewer #2); John Plass (Reviewer #3).

Our decision has been reached after consultation between the reviewers. The individual reviews below and the discussion have identified significant revisions and additional analysis that are required for the manuscript to be published in eLife. Because this additional work would likely take more than the two months that we allow for revisions, we regret to inform you we must formally decline your manuscript at this time.

However, we would consider a new submission in which the concerns raised below have been addressed. If you choose to submit a new version of the manuscript to eLife, we will make every effort for the new manuscript to be assessed by the same reviewers and Reviewing Editor.

Reviewer #1:

The authors investigate audiovisual speech processing using MEG. They present participants with speech in noise, as well as with videos of talking faces. In both conditions, subjects understand about 70% of the speech. The authors then investigate which brain areas decode word identity, as well as which brain areas predict the subject's actual comprehension. They find largely distinct areas for encoding of words as well as for predicting subject performance. Moreover, for both aspects of speech processing, the brain areas differ largely between the auditory and the visual stimulus presentation.

The paper is well written, and the obtained results shed new light on audio-visual speech processing. In particular, they show a large degree of modality-specific processing, as well as a dissociation between the brain areas for speech representation and for encoding comprehension. I am therefore in favour of publication. But I have two major comments that I would like the authors to address in a revised version.

1) I don't see the point of presenting the supplementary Figure 1. Since the behavioural results are at ceiling, only 8 subjects (those whose performance is away from the ceiling) are included in the MEG analysis. But this number of subjects is too small to draw conclusive results. The authors account for this to a degree by labelling the results as preliminary. But since the results cannot be considered conclusive, I think they should either be left out or modified to be conclusive.

2) The correlation between stimulus classification and the behavioural performance is only done for the visual condition ( – subsection “Strong sensory representations do not necessarily predict behaviour”). The authors state that this correlation can't be performed in the auditory condition because the comprehension scores there were 70%. But Figure 1B shows significant subject-to-subject variability in speech comprehension around 70%. While the variation is less than for lip reading, I don't see a reason why this performance cannot be related to the stimulus classification as well. Please add that.

Reviewer #2:

This paper describes an MEG study in which neural responses to unimodal auditory and visual (lip-read) sentences are analysed to determine the spatial locations at which brain responses provide information to distinguish between words, whether and how these responses are associated with correct perceptual identification of words, and the degree to which common neural representations in overlapping brain regions contribute to perception of auditory and visual speech.

Results show largely distinct networks of auditory, visual brain areas that convey information on word identity in heard and lip-read speech and that contribute to perception (inferred by increased information in neural representations for correctly identified words). There was some, limited neural overlap seen in higher-order language areas (such as inferior frontal gyrus, and temporal pole). However, attempts at cross-decoding – i.e. testing for common neural representations between words that were identified in visual and auditory speech – were non-significant. This is despite significant cross-decoding of both auditory and visual speech using audio-visual presentation.

Results are interpreted as showing separate, modality-specific neural representations of auditory and visual word identities, and taken to explain the independence (i.e. non-correlation) between individual differences in measures of auditory and visual speech perception abilities.

This paper addresses an important and interesting topic. Individual differences in auditory and visual speech perception are well established, and the lack of correlation between these abilities in the population appears to be a well-replicated, but currently under-explained, finding. Indeed, this basic observation goes against the dominant thrust of research on audio-visual speech perception which is largely concerned with the convergence of auditory and visual speech information. The authors' use of source-localised MEG classification analyses to address this issue is novel and the results presented build on a sufficient number of existing findings (e.g. the ability of auditory and visual responses to identify spoken and lip-read words, respectively), for me to find the two more surprising results: (1) limited overlap between audio and visual decoding, and (2) no significant cross-decoding of audio and visual speech to be intriguing and well worth publishing.

However, at the same time, I had some significant concerns that both of the key results that I have identified here depend on null findings in whole brain analyses of source-localised MEG responses. The authors must surely be aware that the absence of results reaching corrected significance cannot be taken to indicate that these effects are definitely absent. Absence of evidence is not evidence of absence, and this is particularly true when; (1) effects are tested using conventional null hypothesis significance testing, and (2) whole-brain correction for multiple comparisons are required which substantially reduce statistical sensitivity.

Only by directly subtracting statistical maps of auditory and visual word classification can the authors conclude that there is a reliable difference between the brain regions that contribute to visual and auditory word identification. Furthermore, the authors' presentation of the regression betas for peak regions from auditory and visual classification (in Figure 3D) are misleading. Given how these peaks are defined (from whole-brain search for regions showing unimodal classification), it's inevitable that maxima from auditory classification will be less reliable when tested on visual classification (and vice-versa).

These problems become particularly acute for concluding – as I think the authors wish to – that auditory areas don't contribute to word classification in visual speech. This requires confirming a null hypothesis. This can only be achieved with a Bayesian analysis which quantifies the likelihood of the null hypothesis in different brain regions. Only by performing this analysis can the authors be confident that there is not reliable classification in auditory areas that would be detected had they performed a study with greater statistical power. The authors might wish to make use of independent data – such as from the audiovisual speech condition presented in the supporting information – to define ROIs and/or expected effect sizes based on independent data.

The same problem with interpretation of null effects arises in the cross-modality decoding analyses. It is striking that these analyses are reliable for audiovisual speech but not for auditory speech. However, while this shows that the method the authors are using can detect reliable effects of a certain size, the analyses presented cannot be used to infer whether the most likely interpretation is that cross-decoding of unimodal auditory and visual speech is absent, or that it is present, but fails to reach a stringent level of significance. The authors wish to conclude that this effect is absent, they say: "The inability to cross-classify auditory and visual speech from local brain activity further supports the conclusion that acoustic and visual speech representations are largely distinct", but do not have a sound statistical basis on which to draw this conclusion.

I think some substantial additional analysis, and careful consideration of which conclusions are, or are not supported by direct statistical analyses are required if this work is to be published with the current framing. This is not to say that word identity decoding in auditory and visual speech is uninteresting, or that the seeming lack of correlation between these abilities is not of interest. Only that I found the fundamental claim in the paper – that there's no common phonological representation of auditory and visual speech – to be insufficiently supported by the data presented in the current manuscript.

Reviewer #3:

In this manuscript, Keital et al. use MEG pattern classification to compare the cortical regions involved in the encoding and comprehension of auditory and visual speech. The article is well-written, addresses a scientifically valuable topic, and extends the current literature by employing multivariate approaches. In their stimulus classification analysis, the authors found that visually-presented words were best classified on the basis of signals localized to occipital regions, while auditory words were best classified in perisylvian regions. In their neuro-behavioral decoding analysis, the authors found that classification strength predicted behavioral accuracy in largely distinct regions during auditory versus visual stimulation. They conclude that perceptually-relevant speech representations are largely modality specific.

While I largely agreed with the authors' rationale in employing these techniques, I had some concerns regarding the statistical/mathematical details of their approach. First and most importantly, the neuro-behavioral decoding analysis presented in Figure 3A and 3C does not directly test the hypothesis that forms their primary conclusion (i.e., that "Largely distinct networks mediate perceptually-relevant auditory and visual speech representations"). To test this hypothesis directly, it would be necessary to compare neuro-behavioral decoding accuracy between the auditory and visual conditions for each vertex and then multiple-comparison correct across vertices. That is, rather than comparing decoding accuracy against the null separately for each condition, decoder accuracy should be compared directly across conditions. The authors perform a similar analysis in Figure 3D, but this analysis may inflate Type-I error because it involves pre-selecting peaks identified in each single-condition analysis.

Second, throughout the manuscript, a wide variety of statistical techniques are used which are sometimes internally inconsistent, weakly or not explicitly justified, or interpreted in a manner that is not fully consistent with their mathematical implications. For example, for different analyses, different multiple comparison correction procedures are used (FDR vs. cluster-based FWE control), without a clear justification. It would be best to use similar corrections across analyses to ensure similar power. Also, the authors report that they: "Tested a number of different classifiers, […] then selected a correlation-based nearest-neighbour classifier as this performed slightly better than the others". Absent an a priori justification for the chosen approach, it would be helpful to know whether the reported results are robust to these design decisions. Finally, it is not clear to me that the neuro-behavioral decoding technique employed here is particularly well-suited for identifying regions that represent participants' percepts. It seems the most natural way to perform such an analysis would be to train a classifier to predict participants' trial-wise responses. By contrast, the technique employed here compares (binary) trial-wise accuracy with "representational distance", computed as the difference between "the average correlation with trials of the same (correct) word identity and the mean of the correlation with the three alternatives." Thus, the classifier would not be expected to identify regions with response patterns that predict participants' percepts, but those which exhibit a target-like pattern when the participant responds accurately. It is therefore perhaps best conceived of as an "accuracy classifier" rather than a "percept classifier." This may be problematic for the authors' interpretation because activity in areas unrelated to perceptual representations (e.g., areas involved in unimodal attention) could also be predictive of accuracy.

Finally, in some cases, the methods description was not self-sufficient, leaving the reader to consult references to fully understand. One critical question is how the spatial and temporal dimensions of the data were used in the classifier. If classification is primarily driven by the temporal dimension, it is not clear that successful classification really relies on pattern similarity in population responses, rather than inter-trial temporal covariation produced by, e.g., phase-resetting or entrainment to stimulus dynamics. In the cited methods articles, spatial classification is performed separately for different time bins, alleviating this concern. It would be important to critically consider this detail in interpreting these results.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Largely distinct networks mediate perceptually-relevant auditory and visual speech representations" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tobias Reichenbach as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Matthew H Davis (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is 'in revision at eLife'. Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

The revised paper presents a better-fitting analysis, and does a more nuanced job in discussing the results, than the original manuscript. However, there are still a few major criticisms that we have for the analysis, detailed below.

Essential revisions:

1) Brain-wide, multiple-comparison corrected tests comparing auditory versus visual decoding are still lacking. The authors have now provided vertex-wise Bayes factors within areas that showed significant decoding in each individual condition. Unfortunately, this is not satisfactory, because these statistics are (1) potentially circular because ROIs were pre-selected based on an analysis of individual conditions, (2) not multiple-comparison corrected, and (3) rely on an arbitrary prior that is not calibrated to the expected effect size. Still, ignoring these issues, the only area that appears to contain vertices with "strong evidence" for a difference in neuro-behavioral decoding is the MOG, which wouldn't really support the claim of "largely distinct networks" supporting audio vs. visual speech representation.

The authors may address these issues, for instance, by

i) presenting additional whole-brain results – e.g. for a direct comparison of auditory and visual classification (in Figure 2) and of perceptual prediction (in Figure 3).

ii) presenting voxel-wise maps of Bayesian evidence values (as in Figure 2—figure supplement 3) for the statistical comparisons shown in Figure 2D, and Figure 3D

iii) in the text included in Figure 2D and 3D making clear what hypotheses correspond to the null hypothesis and to the alternative hypothesis (i.e. auditory = visual, auditory <> visual).

2) As noted before by reviewer 3, the classifiers used in this study do not discriminate between temporal versus spatial dimensions of decoding accuracy. This leaves it unclear whether the reported results are driven by (dis)similarity of spatial patterns of activity (as in fMRI-based MVPA), temporal patterns of activity (e.g., oscillatory "tracking" of the speech signal), or some combination. As these three possibilities could lead to very different interpretations of the data, it seems critical to distinguish between them. For example, the authors write "the encoding of the acoustic speech envelope is seen widespread in the brain, but correct word comprehension correlates only with focal activity in temporal and motor regions," but, as it stands, their results could be partly driven by this non-specific entrainment to the acoustic envelope.

In their response, the authors show that classifier accuracy breaks down when spatial or temporal information is degraded, but it would be more informative to show how these two factors interact. For example, the methods article cited by the authors (Grootswagers, Wardle and Carlson, 2017) shows classification accuracy for successive time bins after stimulus onset (i.e., they train different classifiers for each time bin 0-100 ms, 100-200 ms, etc.). The timing of decoding accuracy in different areas could also help to distinguish between different plausible explanations of the results.

Finally, it is somewhat unclear how spatial and temporal information are combined in the current classifier. Figure 1—figure supplement 2 creates the impression that the time-series for each vertex within a spotlight were simply concatenated. However, this would conflate within-vertex (temporal) and across-vertex (spatial) variance.

3) The concern that the classifier could conceivably index factors influencing "accuracy" rather than the perceived stimulus does not appear to be addressed sufficiently. Indeed, the classifier is referred to as identifying "sensory representations" throughout the manuscript, when it could just as well identify areas involved in any other functions (e.g., attention, motor function) that would contribute to accurate behavioral performance. This limitation should be acknowledged in the manuscript. The authors could consider using the timing of decoding accuracy in different areas to disambiguate these explanations.

The authors state in their response that classifying based on the participant's reported stimulus (rather than response accuracy) could "possibly capture representations not related to speech encoding but relevant for behaviour only (e.g. pre-motor activity). These could be e.g. brain activity that leads to perceptual errors based on intrinsic fluctuations in neural activity in sensory pathways, noise in the decision process favouring one alternative response among four choices, or even noise in the motor system that leads to a wrong button press without having any relation to sensory representations at all."

But, it seems that all of these issues would also effect the accuracy-based classifier as well. Moreover, it seems that intrinsic fluctuations in sensory pathways, or possibly noise in the decision process, are part of what the authors are after. If noise in a sensory pathway can be used to predict particular inaccurate responses, isn't that strong evidence that it encodes behaviorally-relevant sensory representations? For example, intrinsic noise in V1 has been found to predict responses in a simple visual task in non-human primates, with false alarm trials exhibiting noise patterns that are similar to target responses (Seidemann and Geisler, 2018). Showing accurate trial-by-trial decoding of participants' incorrect responses could similarly provide stronger evidence that a certain area contributes to behavior.

https://doi.org/10.7554/eLife.56972.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

We would like to thank the reviewers and editors for their very helpful and constructive comments. Based on these we have substantially improved the manuscript, both by refining the statistical approach and the interpretation of the results.

A main concern shared by several reviewers was that some of our conclusions were based on statistical null results, and these were not sufficiently supported by quantitative analysis (e.g. Bayes factors). With this in mind, we have revised the entire analysis pipeline. First, we ensured that all analyses are based on the same statistical corrections for multiple comparisons (which was another concern). Second, we implemented direct condition contrasts as requested and now provide Bayes factors to substantiate the evidence for or against the respective null hypotheses. Third, we implemented additional control analyses to understand whether and how spatial and temporal response patterns contribute to the overall classification performance. And fourth, we analysed semantic and phonological stimulus features to obtain a better understanding of the sensory features driving comprehension performance.

All analyses were computed de novo for the revision. Due to changes in the statistical procedures some of the specific details in the results have changed (e.g. the number of clusters emerging in specific analyses). However, the main conclusions put forward in the previous version still hold and thanks to the additional analyses we can now provide a more refined interpretation of these in the Discussion.

Reviewer #1:

The authors investigate audiovisual speech processing using MEG. They present participants with speech in noise, as well as with videos of talking faces. In both conditions, subjects understand about 70% of the speech. The authors then investigate which brain areas decode word identity, as well as which brain areas predict the subject's actual comprehension. They find largely distinct areas for encoding of words as well as for predicting subject performance. Moreover, for both aspects of speech processing, the brain areas differ largely between the auditory and the visual stimulus presentation.

The paper is well written, and the obtained results shed new light on audio-visual speech processing. In particular, they show a large degree of modality-specific processing, as well as a dissociation between the brain areas for speech representation and for encoding comprehension. I am therefore in favour of publication. But I have two major comments that I would like the authors to address in a revised version.

1) I don't see the point of presenting the supplementary Figure 1. Since the behavioural results are at ceiling, only 8 subjects (those whose performance is away from the ceiling) are included in the MEG analysis. But this number of subjects is too small to draw conclusive results. The authors account for this to a degree by labelling the results as preliminary. But since the results cannot be considered conclusive, I think they should either be left out or modified to be conclusive.

We agree that the number of participants who performed below ceiling in the audiovisual condition is too small for conclusive results. Hence, we have removed the respective data from the manuscript but decided to keep the audiovisual results that included all participants in Figure 2—figure supplement 2. There we show the AV data regarding behavioural performance (N = 20) and classification performance (N = 18). We also kept the cross-classification analysis between A/V and the AV conditions to demonstrate that cross-classification is possible in principle (Figure 2—figure supplement 3).

2) The correlation between stimulus classification and the behavioural performance is only done for the visual condition (subsection “Strong sensory representations do not necessarily predict behaviour”). The authors state that this correlation can't be performed in the auditory condition because the comprehension scores there were 70%. But Figure 1B shows significant subject-to-subject variability in speech comprehension around 70%. While the variation is less than for lip reading, I don't see a reason why this performance cannot be related to the stimulus classification as well. Please add that.

We have added the respective correlation for the auditory condition, which is now presented in Figure 3—figure supplement 2A. We have also added a correlation between auditory/visual classification and individual auditory SNR values (please see response to reviewer 2 below). The statistical analysis for these correlations was adapted to be consistent with analyses in the rest of the manuscript. A cluster-based permutation test did not yield any significant results. We also report respective Bayes factors, which supported overall results (Figure 3—figure supplement 2B).

Reviewer #2:

This paper describes an MEG study in which neural responses to unimodal auditory and visual (lip-read) sentences are analysed to determine the spatial locations at which brain responses provide information to distinguish between words, whether and how these responses are associated with correct perceptual identification of words, and the degree to which common neural representations in overlapping brain regions contribute to perception of auditory and visual speech.

Results show largely distinct networks of auditory, visual brain areas that convey information on word identity in heard and lip-read speech and that contribute to perception (inferred by increased information in neural representations for correctly identified words). There was some, limited neural overlap seen in higher-order language areas (such as inferior frontal gyrus, and temporal pole). However, attempts at cross-decoding – i.e. testing for common neural representations between words that were identified in visual and auditory speech – were non-significant. This is despite significant cross-decoding of both auditory and visual speech using audio-visual presentation.

Results are interpreted as showing separate, modality-specific neural representations of auditory and visual word identities, and taken to explain the independence (i.e. non-correlation) between individual differences in measures of auditory and visual speech perception abilities.

This paper addresses an important and interesting topic. Individual differences in auditory and visual speech perception are well established, and the lack of correlation between these abilities in the population appears to be a well-replicated, but currently under-explained, finding. Indeed, this basic observation goes against the dominant thrust of research on audio-visual speech perception which is largely concerned with the convergence of auditory and visual speech information. The authors' use of source-localised MEG classification analyses to address this issue is novel and the results presented build on a sufficient number of existing findings (e.g. the ability of auditory and visual responses to identify spoken and lip-read words, respectively), for me to find the two more surprising results: (1) limited overlap between audio and visual decoding, and (2) no significant cross-decoding of audio and visual speech to be intriguing and well worth publishing.

However, at the same time, I had some significant concerns that both of the key results that I have identified here depend on null findings in whole brain analyses of source-localised MEG responses. The authors must surely be aware that the absence of results reaching corrected significance cannot be taken to indicate that these effects are definitely absent. Absence of evidence is not evidence of absence, and this is particularly true when; (1) effects are tested using conventional null hypothesis significance testing, and (2) whole-brain correction for multiple comparisons are required which substantially reduce statistical sensitivity.

Only by directly subtracting statistical maps of auditory and visual word classification can the authors conclude that there is a reliable difference between the brain regions that contribute to visual and auditory word identification. Furthermore, the authors' presentation of the regression betas for peak regions from auditory and visual classification (in Figure 3D) are misleading. Given how these peaks are defined (from whole-brain search for regions showing unimodal classification), it's inevitable that maxima from auditory classification will be less reliable when tested on visual classification (and vice-versa).

The reviewer points out a critical shortcoming in our previous submission: the lack of evidence for statistical null results. We have addressed this point using additional data analyses, whereby we now provide direct between-condition contrasts for all grid-points in the significant clusters, both for the stimulus classification (Figure 2) and the neuro-behavioural analysis (Figure 3). For each of these we computed a direct contrast between the A and V conditions and derived the associated Bayes factors to substantiate the evidence for or against the respective null hypothesis. These new results are presented in Figures 2D and 3D and support our conclusion that stimulus information for acoustic and visual speech is provided both by potentially amodal regions (e.g. post-central regions) as well as by regions that contain (significant) information only about a single modality (e.g. occipital regions). Furthermore, they support the conclusion that auditory and visual comprehension are driven in large by distinct networks, but also are facilitated by an overlap of auditory and visual representations in angular and frontal regions. We revised the Discussion and conclusions in the light of this refined analysis.

The reviewer also points out that our presentation of the effects obtained at local peak voxels was misleading. We did not intend to present these as an (indeed) circular statistical analysis, but rather as post-hoc visualisation and quantification of the underlying effects. In the revised manuscript, we now avoid this potentially misleading step, and derive significant clusters from the respective full-brain maps and simply report local peak results in Tables 1 and 2.

These problems become particularly acute for concluding – as I think the authors wish to – that auditory areas don't contribute to word classification in visual speech. This requires confirming a null hypothesis. This can only be achieved with a Bayesian analysis which quantifies the likelihood of the null hypothesis in different brain regions. Only by performing this analysis can the authors be confident that there is not reliable classification in auditory areas that would be detected had they performed a study with greater statistical power. The authors might wish to make use of independent data – such as from the audiovisual speech condition presented in the supporting information – to define ROIs and/or expected effect sizes based on independent data.

Following this comment, we now provide a systematic analysis of statistical contrasts and associated Bayes factors (Figures 2D and 3D). These new results allow us to directly support some of our original conclusions, but also provide a more nuanced picture. Starting from those regions where cerebral word representations are significantly predictive of comprehension (Figure 3A and B) we find that in some regions many grid points exhibit evidence for a differential contribution to auditory and visual comprehension, while in other regions (e.g. IFG, AG) many grid points exhibit evidence for no modality specificity (Figure 3D). We have revised the Discussion to fully reflect these results.

We would like to point out that our data indeed suggest the involvement of temporal regions in visual speech comprehension (right STG in Figure 3B), but that this region does not predict auditory speech comprehension. Our main conclusion is therefore that regions predictive of auditory and visual comprehension are largely distinct.

For example: “Thereby our results support a route of visual speech into auditory cortical and temporal regions but provide no evidence for an overlap of speech representations in the temporal lobe that would facilitate both lip-reading and acoustic speech comprehension.” A more elaborate summary of our conclusions can be found in the Discussion.

The same problem with interpretation of null effects arises in the cross-modality decoding analyses. It is striking that these analyses are reliable for audiovisual speech but not for auditory speech. However, while this shows that the method the authors are using can detect reliable effects of a certain size, the analyses presented cannot be used to infer whether the most likely interpretation is that cross-decoding of unimodal auditory and visual speech is absent, or that it is present, but fails to reach a stringent level of significance. The authors wish to conclude that this effect is absent, they say: "The inability to cross-classify auditory and visual speech from local brain activity further supports the conclusion that acoustic and visual speech representations are largely distinct", but do not have a sound statistical basis on which to draw this conclusion.

We addressed this comment in different ways. First, by revising the significance testing of the cross-decoding performance, for which we now report brain-wide cluster-based permutation statistics. This new statistical analysis (which is now consistent with the cluster statistics in the rest of the manuscript), did not yield any significant effects (Figure 2—figure supplement 3A). Second, we now also provide Bayes factors based on t-values, derived from a comparison with a random distribution. The topographical maps (Figure 2—figure supplement 3B) show that for the large majority of brain regions, cross-classification is not possible (Bayes factors supporting the H0), with the exception of some irregular grid points. An exploratory cluster analysis (only for the purpose of this response) found that, even with a minimum cluster size of 1 grid point, no significant clusters occurred. We are therefore confident that cross-classification is not possible in a meaningful way between the auditory and visual conditions for the present data.

I think some substantial additional analysis, and careful consideration of which conclusions are, or are not supported by direct statistical analyses are required if this work is to be published with the current framing. This is not to say that word identity decoding in auditory and visual speech is uninteresting, or that the seeming lack of correlation between these abilities is not of interest. Only that I found the fundamental claim in the paper – that there's no common phonological representation of auditory and visual speech – to be insufficiently supported by the data presented in the current manuscript.

The revised paper now clearly spells out which results are substantiated by appropriate statistical evidence for modality specific results, and which not. In doing so, we can now provide a more detailed and nuanced view on which regions contribute perceptually-relevant word encoding for acoustic and visual speech.

For example, in the first paragraph of the Discussion we state: “Our results show that the cerebral representations of auditory and visual speech are mediated by both modality specific and overlapping (potentially amodal) representations. While several parietal, temporal and frontal regions were engaged in the encoding of both acoustically and visually conveyed word identities (“stimulus classification”), comprehension in both sensory modalities was largely driven by distinct networks. Only the inferior frontal and angular gyrus contained regions that contributed similarly to both auditory and visual comprehension.”

Reviewer #3:

In this manuscript, Keital et al. use MEG pattern classification to compare the cortical regions involved in the encoding and comprehension of auditory and visual speech. The article is well-written, addresses a scientifically valuable topic, and extends the current literature by employing multivariate approaches. In their stimulus classification analysis, the authors found that visually-presented words were best classified on the basis of signals localized to occipital regions, while auditory words were best classified in perisylvian regions. In their neuro-behavioral decoding analysis, the authors found that classification strength predicted behavioral accuracy in largely distinct regions during auditory versus visual stimulation. They conclude that perceptually-relevant speech representations are largely modality specific.

While I largely agreed with the authors' rationale in employing these techniques, I had some concerns regarding the statistical/mathematical details of their approach. First and most importantly, the neuro-behavioral decoding analysis presented in Figure 3A and 3C does not directly test the hypothesis that forms their primary conclusion (i.e., that "Largely distinct networks mediate perceptually-relevant auditory and visual speech representations"). To test this hypothesis directly, it would be necessary to compare neuro-behavioral decoding accuracy between the auditory and visual conditions for each vertex and then multiple-comparison correct across vertices. That is, rather than comparing decoding accuracy against the null separately for each condition, decoder accuracy should be compared directly across conditions. The authors perform a similar analysis in Figure 3D, but this analysis may inflate Type-I error because it involves pre-selecting peaks identified in each single-condition analysis.

As reported in the reply to reviewer 2, we now provide direct statistical contrasts between A and V conditions and report Bayes factors for the null hypotheses to support our conclusions. These analyses in large support our previous claims, but also highlight a more nuanced picture, as reflected in the revised Discussion. Please see also comments above. e.g. in the first paragraph of the Discussion we now write: “Our results show that the cerebral representations of auditory and visual speech are mediated by both modality-specific and overlapping (potentially amodal) representations. While several parietal, temporal and frontal regions were engaged in the encoding of both acoustically and visually conveyed word identities (“stimulus classification”), comprehension in both sensory modalities was largely driven by distinct networks. Only the inferior frontal and angular gyrus contained regions that contributed similarly to both auditory and visual comprehension.”

Second, throughout the manuscript, a wide variety of statistical techniques are used which are sometimes internally inconsistent, weakly or not explicitly justified, or interpreted in a manner that is not fully consistent with their mathematical implications. For example, for different analyses, different multiple comparison correction procedures are used (FDR vs. cluster-based FWE control), without a clear justification. It would be best to use similar corrections across analyses to ensure similar power. Also, the authors report that they: "Tested a number of different classifiers, […] then selected a correlation-based nearest-neighbour classifier as this performed slightly better than the others". Absent an a priori justification for the chosen approach, it would be helpful to know whether the reported results are robust to these design decisions.

We have revised all statistical analyses and now use the same correction methods for all fullbrain analyses. In a preliminary analysis we had tested different classifiers in their ability to classify the stimulus set (e.g. a nearest neighbour classifier based on the Euclidean distance, and correlation decoders based on different temporal sampling). Based on the overall performance across all three conditions (A,V,AV) we decided on the classifier and parameter used, although we note that the differences between version of the classifier were small (23%). In part, we now directly address this point by reporting how the stimulus classification performance depends on the spatial searchlight and the temporal resolution. We have repeated the stimulus classification using a range of spatial and temporal parameters for the searchlight. In particular, we removed the spatial information and systematically reduced the temporal resolution.

These complementary results are now mentioned in the manuscript.

“We also probed classification performance based on a number of spatio-temporal searchlights, including temporal binning of the data at 3.3, 20, 40 and 60ms, and including each neighbouring spatial grid point into the searchlight or averaging across grid points. Comparing classification performance revealed that in particular the auditory condition was sensitive to the choice of searchlight. Classification performance dropped when ignoring the spatial configuration or sampling the data at a resolution lower than 20 ms (median performance of the 10% grid points with highest performance based on each searchlight: 27.0%, 26.9% 26.5% and 25.8% for 3.3, 20, 40 and 60-ms bins and the full spatial searchlight; and 26.4% at 20ms and ignoring the spatial pattern). We hence opted for a classifier based on the source data represented as spatial searchlight (1.2-cm radius) and sampled at 20-ms resolution.”

In Author response image 1 is an illustration of these results.

Author response image 1

Finally, it is not clear to me that the neuro-behavioral decoding technique employed here is particularly well-suited for identifying regions that represent participants' percepts. It seems the most natural way to perform such an analysis would be to train a classifier to predict participants' trial-wise responses. By contrast, the technique employed here compares (binary) trial-wise accuracy with "representational distance", computed as the difference between "the average correlation with trials of the same (correct) word identity and the mean of the correlation with the three alternatives." Thus, the classifier would not be expected to identify regions with response patterns that predict participants' percepts, but those which exhibit a target-like pattern when the participant responds accurately. It is therefore perhaps best conceived of as an "accuracy classifier" rather than a "percept classifier." This may be problematic for the authors' interpretation because activity in areas unrelated to perceptual representations (e.g., areas involved in unimodal attention) could also be predictive of accuracy.

The reviewer touches on an important issue concerning the interpretation of the mapped representations. Our study was motivated by the notion of intersection information, that is the search for cerebral representations of a stimulus that are used for the respective single trial behaviour (Panzeri et al., 2017; Pica et al., 2017). This intersection information can be formalised theoretically and can be measured using information theoretic approaches. However, these principled approaches are still computationally inefficient. Following the neuroimaging field, we hence opted for a distanceto-bound method, where the amount of sensory evidence captured in a classifier is regressed against behaviour using a linear model (see, e.g. Grootswagers, Cichy and Carlson, 2018). This method is computationally cheaper and allows for full-brain permutation statistics. In contrast to this approach, a classifier trained on participants response (choice) would, at least on error trials, possibly capture representations not related to speech encoding but relevant for behaviour only (e.g. pre-motor activity). These could be e.g. brain activity that leads to perceptual errors based on intrinsic fluctuations in neural activity in sensory pathways, noise in the decision process favouring one alternative response among four choices, or even noise in the motor system that leads to a wrong button press without having any relation to sensory representations at all. The latter example highlights the need to base any analysis of behaviourally-relevant sensory representations (i.e. the intersection information) on classifiers which are firstly trained to discriminate the relevant sensory information and which are then probed as to how behaviourally-relevant the classifier output is. We have revised the Discussion to better explain the rationale of our approach in this respect.

Finally, in some cases, the methods description was not self-sufficient, leaving the reader to consult references to fully understand. One critical question is how the spatial and temporal dimensions of the data were used in the classifier. If classification is primarily driven by the temporal dimension, it is not clear that successful classification really relies on pattern similarity in population responses, rather than inter-trial temporal covariation produced by, e.g., phase-resetting or entrainment to stimulus dynamics. In the cited methods articles, spatial classification is performed separately for different time bins, alleviating this concern. It would be important to critically consider this detail in interpreting these results.

We have revised the Materials and methods in many instances for ensure that all procedures are described clearly. We have added more information about the spatio-temporal dimensions of the classifier (see also above), noting that both spatial and temporal patterns of local brain activity were contributing to the stimulus classification. Whether the neural “mechanisms” mentioned, such as phase-resetting, indeed contribute to the cerebral representations studied here is a question that is surely beyond the scope of this study.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Essential revisions:

1) Brain-wide, multiple-comparison corrected tests comparing auditory versus visual decoding are still lacking. The authors have now provided vertex-wise Bayes factors within areas that showed significant decoding in each individual condition. Unfortunately, this is not satisfactory, because these statistics are (1) potentially circular because ROIs were pre-selected based on an analysis of individual conditions, (2) not multiple-comparison corrected, and (3) rely on an arbitrary prior that is not calibrated to the expected effect size. Still, ignoring these issues, the only area that appears to contain vertices with "strong evidence" for a difference in neuro-behavioral decoding is the MOG, which wouldn't really support the claim of "largely distinct networks" supporting audio vs. visual speech representation.

The authors may address these issues, for instance, by

i) presenting additional whole-brain results – e.g. for a direct comparison of auditory and visual classification (in Figure 2) and of perceptual prediction (in Figure 3).

ii) presenting voxel-wise maps of Bayesian evidence values (as in Figure 2—figure supplement 3) for the statistical comparisons shown in Figure 2D, and Figure 3D

iii) in the text included in Figure 2D and 3D making clear what hypotheses correspond to the null hypothesis and to the alternative hypothesis (i.e. auditory = visual, auditory <> visual).

We addressed this comment using additional data analysis to ensure that all claims are supported by sufficient statistical evidence.

i) In the revised manuscript, we now provide the suggested full-brain cluster-corrected contrasts between auditory and visual conditions for both main analyses (Figure 2—figure supplement 1A, Figure 3—figure supplement 1A). However, we caution against the interpretation of condition-wise differences at grid points that do not exhibit significant evidence for the primary “function” of interest. We therefore refrain from interpreting condition-wise differences in word classification performance (or the prediction of comprehension) at grid points that do not exhibit significant word classification (prediction of comprehension) in at least one condition (vs. the randomization null). Hence, we masked the full-brain cluster-corrected condition differences against all grid points contributing to word classification (or the prediction of comprehension) in at least one modality (while also reporting all clusters in the figure caption).

Importantly, and in line with the general concerns raised about interpreting null results, these full-brain condition differences can provide evidence in favour of a condition difference, and therefore support the existence of modality specific regions. However, they cannot provide evidence in favour of a null finding of no modality specialisation. The analysis of Bayes factors provided in the previous revision, which we retained in Figures 2D,3D, in contrast, can provide such evidence in favour of a null result. Hence in the revised manuscript we kept the Bayes factors in the main figure, while now also providing the full-brain cluster-based statistics in the supplemental material. Importantly, we base all interpretations of the results on the combined evidence provided by the full-brain condition differences and these Bayes factors.

ii) We now also provide full-brain maps with Bayes factors for contrasts between the two modalities (in Figure 2—figure supplement 1B, Figure 3—figure supplement 1B, alongside the full-brain cluster maps).

iii) We have added the specific hypotheses to Figures 2 and 3 as suggested.

The results of these additional statistical tests do not affect our main findings, but they support the conclusions derived from the ROI-based analysis. We carefully ensured that the revised manuscript clearly acknowledges that the individual analyses are inconclusive for some parts of the brain or offer specific evidence for no difference between modalities in other parts. For example:

“A separate full-brain cluster-based permutation test (Figure 3—figure supplement 1A) provided evidence for a significant modality specialisation for auditory words in four clusters in the left middle occipital gyrus, left calcarine gyrus, right posterior angular gyrus, and bilateral supplementary motor area. The corresponding full-brain Bayes factors (Figure 3—figure supplement 1B) support this picture but also provide no evidence for a modality preference, or inconclusive results, in many other regions.”

To acknowledge that there is a large number of grid points that do not show a modality difference, we have also revised the title of this manuscript to “Shared and modality-specific brain regions that mediate auditory and visual word comprehension”

The comment also suggests that the analyses of ROI-specific comparisons may be circular. First, we now provide the full brain results for the Bayes factors in Figure 2—figure supplement 1B and Figure 3—figure supplement 1B. Concerning the ROI-based results in Figures 2,3, it is important to note that we compare condition-wise differences (as Bayes factors) within regions pre-selected to show an effect in at least one of the two modalities, hence independent of the contrast. We do so, as the interpretation of a condition-wise difference (e.g. in word classification) in a brain region not exhibiting significant classification in any condition is difficult, if not impossible. Such pre-selection of electrodes or regions of interest is very common (Cheung et al., 2016, Giordano et al., 2017, Karas et al., 2019, Mihai et al., 2019, Ozker, Yoshor and Beauchamp, 2018) and supported by the use of orthogonal contrasts for selection and comparison.

The comment further notes that we relied on an “arbitrary prior” to calculate Bayes factors. We would like to emphasize that the JZS Bayes factor, while not being data-driven by the present study, is not arbitrary. It has been advocated heavily by a number of studies for situations where no specific information about the expected effects sizes is available 2009(Rouder et al., 2012; ) and is highly accepted in the current literature (Guitard and Cowan, 2020, Mazor, Friston and Fleming, 2020, Puvvada and Simon, 2017, Kimel, Ahissar and Lieder, 2020). The default scale of 22 corresponds to the assumption that the prior of effect sizes follows a Cauchy distribution with 50% of probability mass placed on effect sizes smaller than 22, and 50% larger than this number (Schönbrodt and Wagenmakers, 2018). Looking at our data, the effect sizes in the neurobehavioral analysis (Table 2) seem to follow such a pattern rather well (4 out of 9 effect sizes, calculated as Cohen’s D, are smaller than 22, while 5 are larger than this). Hence, taking this one specific statistical contrast from our study as evidence, it makes sense to assume that similar effect sizes are expected also in the other tests, such as the condition-wise differences for which we report the Bayes factors in Figures 2D, 3D. Of course, in an ideal case one would base the choice of the prior on pre-existing and independent data, however, unfortunately such data were not available.

2) As noted before by reviewer 3, the classifiers used in this study do not discriminate between temporal versus spatial dimensions of decoding accuracy. This leaves it unclear whether the reported results are driven by (dis)similarity of spatial patterns of activity (as in fMRI-based MVPA), temporal patterns of activity (e.g., oscillatory "tracking" of the speech signal), or some combination. As these three possibilities could lead to very different interpretations of the data, it seems critical to distinguish between them. For example, the authors write "the encoding of the acoustic speech envelope is seen widespread in the brain, but correct word comprehension correlates only with focal activity in temporal and motor regions," but, as it stands, their results could be partly driven by this non-specific entrainment to the acoustic envelope.

In their response, the authors show that classifier accuracy breaks down when spatial or temporal information is degraded, but it would be more informative to show how these two factors interact. For example, the methods article cited by the authors (Grootswagers, Wardle and Carlson, 2017) shows classification accuracy for successive time bins after stimulus onset (i.e., they train different classifiers for each time bin 0-100 ms, 100-200 ms, etc.). The timing of decoding accuracy in different areas could also help to distinguish between different plausible explanations of the results.

This comment raises a number of interesting questions, which we (partly) addressed in the previous round. Unfortunately, maybe, our treatment of this question in the previous round was not fully comprehensive, and the results were reported only in a single sentence in the Materials and methods. To address these points in full, we have done a series of additional analyses and report these results more extensively in the manuscript and in this reply.

We start noting that the use of a spatio-temporal searchlight in MEG source analysis is common in the literature (Cao et al., 2019, Giordano, et al., 2017, Kocagoncu et al., 2017, Su et al., 2012), and is analogous to the inclusion of all M/EEG sensors in sensory-based analyses relying on classification methods or RSA analysis (Cichy and Pantazis, 2017, Guggenmos, Sterzer and Cichy, 2018, Kaiser, Azzalini and Peelen, 2016 ). However, and maybe unlike the use of spatial searchlights in many fMRI studies, the spatial component (on the scale of 1.2 cm as used here) in MEG source space is expected to add only minor information, given the natural smoothness of MEG source data. Hence, the emphasis for the present analysis was on the temporal domain.

Concerning the duration of the chosen time window in the main analysis (500ms), we note that this has been adapted to the specifics of the experimental design and task (Figure 1). In particular, this window was chosen to cover the different target words as much as possible without including the following word in the sentence (which is different in every sentence and would therefore have contaminated the decoding analysis). Spoken words are temporally extended, and to study their cerebral encoding it makes sense to use a time window that covers most of the stimulus duration. Otherwise one runs the risk of capturing processes related to lexical exploration/competition or word predictions (Klimovich-Gray et al., 2019, Kocagoncu, et al., 2017, Marslen-Wilson and Welsh, 1978). We therefore chose a longer time window. This is possibly in contrast to studies on visual object encoding, where stimuli are often flashed for a few tens of milliseconds only, and the encoding window often extends this by a certain, but ambiguous amount. In contrast, the choice of a 500-ms window here is parsimonious given the nature of the stimuli and task.

Importantly, by task design, the sentence continued beyond the target word, which was the 2nd or 3rd last word in the sentence (Figure 1). Hence, the sentence stimuli continued beyond the analysis window (for 1.48 s to 2.81s (1.98 ± 0.31 s [M ± SD])) and any motor response, or the relevant response options presented to the participants, followed much later. Using this design, we ensured that motor preparation is very unlikely to emerge within the analysed time window.

To address this comment using data analysis, we first compared the classification performance with and without the spatial dimension in the searchlight. As noted above, the natural expectation is that the spatial dimension adds only little additional information in comparison to the time domain. In the additional analysis, we quantified whether and by how much the spatial dimension adds to the ability to classify word identities (Author response image 2). The average classification performance (within the 10% grid points with the highest performance in the classification including searchlight) differed little (auditory: 27.19 ± 0.48% [M ± SD] with searchlight vs 26.72 ± 1.89% without searchlight; visual: 28.71 ± 1.55% with searchlight vs 27.71 ± 2.44% without searchlight, [averages across grid points and participants]; this equals an overall percent change in performance of -1.49% and -2.45% in the auditory and visual condition, respectively). A correlation between the full-brain maps with and without spatial dimension showed that 99.8% of grid points in the auditory, and 99.7% of grid points in the visual condition, were significantly correlated (Pearson correlation, at p <.05, FDR-corrected). Finally, a direct group-level comparison revealed that the majority of grid points (66.9% in the auditory and 63.9% in the visual condition) showed evidence for no difference (i.e. BF10 < 1/3) between including or excluding the spatial dimension (see Author response image 2, bottom panel). This shows that the spatial component contributes only modestly to the classification performance. This result is now reported in the manuscript as follows:

“We also probed the influence of the spatial searchlight by i) including each neighbouring spatial grid point into the searchlight, or ii) averaging across grid points, and iii) by not including a searchlight altogether. Ignoring the spatial pattern by averaging grid points led to a small drop in classification performance (individual vs average grid points: auditory, 27.19 ± 0.48 vs 26.72 ± 0.67; visual, 28.71 ± 1.55 vs 27.71 ± 1.25 [M ± SD]). Performance also dropped slightly when no searchlight was included (auditory, 26.77 ± 1.93; visual, 27.86 ± 2.50 [M ± SD]).”

Author response image 2
Word classification performance with and without a spatial searchlight.

Top panel: original results including a 1.2-cm searchlight (as in Figure 2 in the manuscript). Middle panel: classification results without searchlight. Bottom panel: Bayes factors of a group-level t-test comparing classification performance with and without searchlight. The majority of grid points (66.9% in the auditory and 63.9% in the visual condition) showed evidence for no difference (i.e. BF10 < 1/3) between tests, while only a small fraction of grid points show evidence for a strong or substantial difference between test. There is no systematic improvement when including the searchlight.

To address the question of how the temporal discretisation of MEG activity within this time window affects classification performance, we compared classifiers operating on data binned at 20, 40, and 60 ms bins. These results had already been reported in the previous reply letter and were included in the Materials and methods section. To give more emphasis to these results we have moved them to a separate section (Selection of parameters and classifier procedures).

In brief, we find that a temporal resolution of 20 ms is sufficient to recover most of the stimulus information contained in the data. Using larger windows would lead to a loss of information, while shorter windows did not seem to add any classification performance.

To address the concern that some form of non-specific temporal entrainment of brain activity may confound our results, we implemented a further analysis using shorter time windows. We divided the original 500-ms window into shorter time epochs, as suggested by the reviewer. The length of these (140 ms) was chosen to avoid contributions of rhythmic activity at the critical time scales of 2 – 4 Hz, which have prominently been implied in speech-to-brain entrainment (e.g. Ding and Simon, 2014, Luo, Liu and Poeppel, 2010, Molinaro and Lizarazu, 2017). We repeated the word classification (and the prediction of comprehension) in 7 partly overlapping (by 60 ms) epochs of 140 ms duration. We then subjected the epoch-specific results to the same full-brain cluster-based permutation statistics as used for the full 500-ms window. Finding significant word classification (or prediction of comprehension) in these shorter epochs would speak against the notion that some sort of entrainment critically contributes to (or confounds) our results.

Before presenting this result, we note that introducing 7 additional time epochs adds to the problem of correcting for multiple comparisons. The results of such a fine-grained analysis can either be considered at a much more stringent criterion (when correcting across all 7 time epochs) or a less stringent criterion (when not correcting, and hence accepting a sevenfold higher false positive rate) compared to the main analysis in the manuscript. We here chose to correct for multiple tests by using Bonferroni correction and adjusting the significance threshold by ɑ=0.057=0.0071.

Author response image 3 presents a cumulative whole-brain histogram of the significant grid points across all 7 epochs and the epoch-specific number of grid points that are significant in each epoch.

Author response image 3
Classification performance and neurobehavioural prediction over time.

Top panels represent cumulative whole-brain histograms of the significant grid points across all 7 epochs and bottom panels represent the epoch-specific number of grid points that are significant in each epoch. Please note that these results are very conservative due to the Bonferroni-corrected threshold of ɑ = 0.0071. The maps resemble those of the full-window analysis presented in the manuscript.

These results provide several important insights: First, they confirm that significant word classification (and prediction of comprehension) can be obtained in shorter epochs, arguing against temporally entrained brain activity presenting some critical confounding factor. Second, the results show that the significant grid points obtained (cumulatively) across epochs cover largely the same regions found using the full time window. Out of those grid points reported in Figures 2 and 3, the percentage of grid points that becomes significant in at least one time epoch is: forword classification: auditory 44.7% of grid points, visual 60.3%; for the neurobehavioural analysis: auditory 19.2% of grid points, visual 24.4%. Note that without the very strict Bonferroni-correction, a much larger frequency of original grid points is also found in the short time windows (classification: 77.3%/82.3% and neurobehavioural: 57.9%/77.2%; auditory/visual).

Third, the fraction of grid points being significant in at least one time epoch but not significant in the analysis of the full time window, and hence emerging only in the analysis of shorter time epochs, is small: 9.2%/5.2% for auditory/visual word classification, 4.6%/2.7% for auditory/visual neurobehaviouralprediction.

These results suggest that the use of the full 500-ms time window can be justified: first, this time window covers a large proportion of target words relevant for the behavioural task and is therefore directly motivated given the experimental design. Second, it is sufficiently separated from the motor response (c.f. Figure 1 and Materials and methods). Third, it is not confounded by temporally entrained activity at 2 – 4 Hz (Author response image 3). And finally, the use of shorter time epochs reveals largely the same brain regions (Author response image 3). However, in contrast to the full time window, the length of any shorter window will always remain arbitrary, hence necessitating one more arbitrary choice in the analysis pipeline.

Finally, it is somewhat unclear how spatial and temporal information are combined in the current classifier. Figure 1—figure supplement 2 creates the impression that the time-series for each vertex within a spotlight were simply concatenated. However, this would conflate within-vertex (temporal) and across-vertex (spatial) variance.

As is common in classification or RSA analyses based on spatio-temporal activity patterns, we indeed concatenated the time series obtained at the different grid points within each spatial neighbourhood. The same procedure is often used in studies using fMRI voxel based analysis and MEG/EEG source level analyses (Cao, et al., 2019, Giordano, et al., 2017, Kocagoncu, et al., 2017, Su, et al., 2012). While this indeed conflates spatial and temporal information, we note that the 1.2-cm radius of the spatial searchlight still retains a high level of spatial information in the overall source map. Most importantly, and as shown by the above analyses, only a small amount of extra information is contained in the spatial pattern, and hence this mixing of spatial and temporal dimensions did not influence our results heavily.

3) The concern that the classifier could conceivably index factors influencing "accuracy" rather than the perceived stimulus does not appear to be addressed sufficiently. Indeed, the classifier is referred to as identifying "sensory representations" throughout the manuscript, when it could just as well identify areas involved in any other functions (e.g., attention, motor function) that would contribute to accurate behavioral performance. This limitation should be acknowledged in the manuscript. The authors could consider using the timing of decoding accuracy in different areas to disambiguate these explanations.

We agree that other factors enhancing the cerebral processes involved in encoding the sensory input and translating this into a percept (e.g. attention) could affect behaviour. At the same time, we designed the paradigm to specifically avoid confounding factors, such as motor preparation, by placing the target words not at the end of the sentence (c.f. Figure 1; Materials and methods). In the revised manuscript we discuss this as follows:

“While we found strongest word classification performance in sensory areas, significant classification also extended into central, frontal and parietal regions. This suggests that the stimulus-domain classifier used here may also capture processes potentially related to attention or motor preparation. While we cannot rule out that levels of attention differed between conditions, we ensured by experimental design that comprehension performance did not differ between modalities. In addition, the relevant target words were placed not at the end of the sentence to prevent motor planning and preparation during their presentation (see Stimuli).”

The idea to investigate the time courses within specific brain areas is interesting, but opens a very large number of additional degrees of freedom. The above presented analysis of small time-windows did this to some extent. We would like to refrain from a complete analysis of decoding over time in different brain areas, as this is a different research question than the one the study was designed to answer.

The authors state in their response that classifying based on the participant's reported stimulus (rather than response accuracy) could "possibly capture representations not related to speech encoding but relevant for behaviour only (e.g. pre-motor activity). These could be e.g. brain activity that leads to perceptual errors based on intrinsic fluctuations in neural activity in sensory pathways, noise in the decision process favouring one alternative response among four choices, or even noise in the motor system that leads to a wrong button press without having any relation to sensory representations at all."

But, it seems that all of these issues would also effect the accuracy-based classifier as well. Moreover, it seems that intrinsic fluctuations in sensory pathways, or possibly noise in the decision process, are part of what the authors are after. If noise in a sensory pathway can be used to predict particular inaccurate responses, isn't that strong evidence that it encodes behaviorally-relevant sensory representations? For example, intrinsic noise in V1 has been found to predict responses in a simple visual task in non-human primates, with false alarm trials exhibiting noise patterns that are similar to target responses (Seidemann and Geisler 2018). Showing accurate trial-by-trial decoding of participants' incorrect responses could similarly provide stronger evidence that a certain area contributes to behavior.

Unfortunately, we are not sure what “accuracy-based classifier” here refers to. The classifier used in the present study classifies the word “identity”, and operates in the stimulus domain, not in the domain of participants' response or accuracy. The evidence contained in this stimulus-domain classifier is then used, in a regression model, as a predictor of participants' response accuracy. In the response to the previous comments, we contrasted this approach with a putative analysis based on a classifier directly trained on participant’s choice; such an analysis had been mentioned in the previous set of reviewer comments (“It seems the most natural way to perform such an analysis would be to train a classifier to predict participants' trial-wise responses” to quote from the previous set of comments). We have tried to make our approach clearer early in the manuscript:

“Using multivariate classification, we quantified how well the single-trial identity of the target words (18 target words, each repeated 10 times) could be correctly predicted from source-localised brain activity (“stimulus classifier”).”

As we argued previously, we believe that our analysis has several strengths in contrast to this suggestion of a choice-based classifier. In particular, the argument that “these issues” (quoted from this reviewer’s point above) affect our analysis as well, does not seem clear to us. For example, a classifier trained on choice would contain information about activity patterns that drive behaviour (erroneously) on trials where participants missed the stimulus. In this case, their sensory cortices would not encode these and overt behaviour would be driven by noise somewhere in the decision or motor circuits. In contrast, an analysis capitalising first on the encoding of the relevant sensory information (by a stimulus-domain classifier), and then using a signature of how well this is encoded in cerebral activity, avoids being confounded by decision or motor noise. The reasoning behind our approach is very much in line with recent suggestions made by various groups for how to best elucidate the sensory representations that drive perception and comprehension, based on an analysis that in a first stage chiefly capitalises on the encoding of sensory information (Grootswagers, Cichy, and Carlson, 2018, Panzeri et al., 2017) and then relates this to behavioural performance. We refined the manuscript to ensure this is very clearly spelled out in different places, for example:

“This approach directly follows the idea to capture processes related to the encoding of external (stimulus-driven) information and to then ask whether these representations correlate over trials with the behavioural outcome or report.”

Of course, no technical approach is perfect, and the statement about noise in a sensory pathway touches upon an interesting issue (“If noise in a sensory pathway can be used to predict particular inaccurate responses, isn't that strong evidence that it encodes behaviourally-relevant sensory representations?” quoted from the present set of comments). Indeed, for our approach it does not matter where precisely variations in noise in the encoding process emerge, as long these affect either the cerebral reflection of sensory information or how this relates on a trial-by-trial basis to behaviour. However, by design of our paradigm (4 alternative response choices) and the type of classifier used (c.f. Materials and methods), the present analysis is capitalising on the correct encoding of the sensory information, while we cannot specifically link the incorrect encoding of a stimulus with behaviour. This arises because evidence against the correct stimulus is not directly evidence in favour of one specific other stimulus; this is in contrast to studies using a two-alternative forced choice design, where evidence against one stimulus / response option directly translates into evidence in favour of the other. We have revised the text to directly reflect this limitation of the present approach:

“In addition, by design of our experiment (4 response options) and data analysis, the neurobehavioral analysis was primary driven by trials in which the respective brain activity encoded the sensory stimulus correctly. We cannot specifically link the incorrect encoding of a stimulus with behaviour. This is in contrast to studies using only two stimulus or response options, where evidence for one option directly provides evidence against the other (Frühholz et al., 2016, Petro et al., 2013).”

References

Cao, Y., Summerfield, C., Park, H., Giordano, B. L., and Kayser, C. (2019). Causal inference in the multisensory brain. Neuron, 102(5), 1076-1087. e1078.

Cheung, C., Hamilton, L. S., Johnson, K., and Chang, E. F. (2016). The auditory representation of speech sounds in human motor cortex. eLife, 5, e12577.

Cichy, R. M., and Pantazis, D. (2017). Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space. NeuroImage, 158, 441-454.

Ding, N., and Simon, J. Z. (2014). Cortical entrainment to continuous speech: functional roles and interpretations. Front Hum Neurosci, 8, 311. doi: 10.3389/fnhum.2014.00311

Frühholz, S., Van Der Zwaag, W., Saenz, M., Belin, P., Schobert, A.-K., Vuilleumier, P., and Grandjean, D. (2016). Neural decoding of discriminative auditory object features depends on their socio-affective valence. Social cognitive and affective neuroscience, 11(10), 1638-1649.

Giordano, B. L., Ince, R. A. A., Gross, J., Schyns, P. G., Panzeri, S., and Kayser, C. (2017). Contributions of local speech encoding and functional connectivity to audio-visual speech perception. elife, 6. doi: 10.7554/eLife.24763

Grootswagers, T., Cichy, R. M., and Carlson, T. A. (2018). Finding decodable information that can be read out in behaviour. Neuroimage, 179, 252-262.

Guggenmos, M., Sterzer, P., and Cichy, R. M. (2018). Multivariate pattern analysis for MEG: a comparison of dissimilarity measures. Neuroimage, 173, 434-447.

Guitard, D., and Cowan, N. (2020). Do we use visual codes when information is not presented visually? Memory and Cognition.

Haxby, J. V., Connolly, A. C., and Guntupalli, J. S. (2014). Decoding neural representational spaces using multivariate pattern analysis. Annual review of neuroscience, 37, 435-456.

Kaiser, D., Azzalini, D. C., and Peelen, M. V. (2016). Shape-independent object category responses revealed by MEG and fMRI decoding. Journal of neurophysiology, 115(4), 2246-2250.

Kamitani, Y., and Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5), 679-685.

Karas, P. J., Magnotti, J. F., Metzger, B. A., Zhu, L. L., Smith, K. B., Yoshor, D., and Beauchamp, M. S. (2019). The visual speech head start improves perception and reduces superior temporal cortex responses to auditory speech. eLife, 8.

Kimel, E., Ahissar, M., and Lieder, I. (2020). Capacity of short-term memory in dyslexia is reduced due to less efficient utilization of items' long-term frequency. bioRxiv.

Klimovich-Gray, A., Tyler, L. K., Randall, B., Kocagoncu, E., Devereux, B., and Marslen-Wilson, W. D. (2019). Balancing prediction and sensory input in speech comprehension: The spatiotemporal dynamics of word recognition in context. Journal of Neuroscience, 39(3), 519-527.

Kocagoncu, E., Clarke, A., Devereux, B. J., and Tyler, L. K. (2017). Decoding the cortical dynamics of sound-meaning mapping. Journal of Neuroscience, 37(5), 1312-1319.

Luo, H., Liu, Z., and Poeppel, D. (2010). Auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation. PLoS Biol, 8(8), e1000445. doi: 10.1371/journal.pbio.1000445

Marslen-Wilson, W. D., and Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive psychology, 10(1), 29-63.

Mazor, M., Friston, K. J., and Fleming, S. M. (2020). Distinct neural contributions to metacognition for detecting, but not discriminating visual stimuli. eLife, 9, e53900.

Mihai, P. G., Moerel, M., de Martino, F., Trampel, R., Kiebel, S., and von Kriegstein, K. (2019). Modulation of tonotopic ventral medial geniculate body is behaviorally relevant for speech recognition. eLife, 8.

Molinaro, N., and Lizarazu, M. (2017). Delta (but not theta)-band cortical entrainment involves speech-specific processing. European Journal of Neuroscience, 48(7), 2642-2650. doi: doi:10.1111/ejn.13811

Ozker, M., Yoshor, D., and Beauchamp, M. S. (2018). Frontal cortex selects representations of the talker’s mouth to aid in speech perception. eLife, 7, e30387.

Panzeri, S., Harvey, C. D., Piasini, E., Latham, P. E., and Fellin, T. (2017). Cracking the neural code for sensory perception by combining statistics, intervention, and behavior. Neuron, 93(3), 491-507.

Petro, L. S., Smith, F. W., Schyns, P. G., and Muckli, L. (2013). Decoding face categories in diagnostic subregions of primary visual cortex. European Journal of Neuroscience, 37(7), 1130-1139.

Puvvada, K. C., and Simon, J. Z. (2017). Cortical representations of speech in a multitalker auditory scene. Journal of Neuroscience, 37(38), 9189-9196.

Ritchie, J. B., Kaplan, D. M., and Klein, C. (2019). Decoding the brain: Neural representation and the limits of multivariate pattern analysis in cognitive neuroscience. The British Journal for the Philosophy of Science, 70(2), 581-607.

Ritchie, J. B., Tovar, D. A., and Carlson, T. A. (2015). Emerging object representations in the visual system predict reaction times for categorization. PLoS computational biology, 11(6), e1004316.

Rouder, J. N., Morey, R. D., Speckman, P. L., and Province, J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5), 356-374.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic bulletin and review, 16(2), 225-237.

Schönbrodt, F. D., and Wagenmakers, E.-J. (2018). Bayes factor design analysis: Planning for compelling evidence. Psychonomic bulletin and review, 25(1), 128-142.

Su, L., Fonteneau, E., Marslen-Wilson, W., and Kriegeskorte, N. (2012). Spatiotemporal searchlight representational similarity analysis in EMEG source space. Paper presented at the 2012 Second International Workshop on Pattern Recognition in NeuroImaging.

https://doi.org/10.7554/eLife.56972.sa2

Article and author information

Author details

  1. Anne Keitel

    1. Psychology, University of Dundee, Dundee, United Kingdom
    2. Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
    Contribution
    Conceptualization, Resources, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    a.keitel@dundee.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4498-0146
  2. Joachim Gross

    1. Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
    2. Institute for Biomagnetism and Biosignalanalysis, University of Münster, Münster, Germany
    Contribution
    Supervision, Funding acquisition, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Christoph Kayser

    Department for Cognitive Neuroscience, Faculty of Biology, Bielefeld University, Bielefeld, Germany
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7362-5704

Funding

Biotechnology and Biological Sciences Research Council (BB/L027534/1)

  • Joachim Gross
  • Christoph Kayser

H2020 European Research Council (ERC-2014-CoG (grant No 646657))

  • Christoph Kayser

Wellcome (Joint Senior Investigator Grant (No 098433))

  • Joachim Gross

Deutsche Forschungsgemeinschaft (GR 2024/5-1)

  • Joachim Gross

Interdisziplinäres Zentrum für Klinische Forschung, Universitätsklinikum Würzburg (Gro3/001/19)

  • Joachim Gross

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research was supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC, BB/L027534/1). CK is supported by the European Research Council (ERC-2014-CoG; grant No 646657); JG by the Wellcome Trust (Joint Senior Investigator Grant, No 098433), DFG (GR 2024/5-1) and IZKF (Gro3/001/19). The authors declare no competing financial interests. We are grateful to Lea-Maria Schmitt for guiding the semantic distance analysis.

Ethics

Human subjects: All participants provided written informed consent prior to testing and received monetary compensation of £10/h. The experiment was approved by the ethics committee of the College of Science and Engineering, University of Glasgow (approval number 300140078), and conducted in compliance with the Declaration of Helsinki.

Senior Editor

  1. Barbara G Shinn-Cunningham, Carnegie Mellon University, United States

Reviewing Editor

  1. Tobias Reichenbach, Imperial College London, United Kingdom

Reviewers

  1. Tobias Reichenbach, Imperial College London, United Kingdom
  2. Matthew H Davis, University of Cambridge, United Kingdom

Publication history

  1. Received: March 16, 2020
  2. Accepted: August 18, 2020
  3. Accepted Manuscript published: August 24, 2020 (version 1)
  4. Version of Record published: September 3, 2020 (version 2)

Copyright

© 2020, Keitel et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 826
    Page views
  • 142
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Microbiology and Infectious Disease
    Thomas Stoeger, Luís A Nunes Amaral
    Feature Article

    It is known that research into human genes is heavily skewed towards genes that have been widely studied for decades, including many genes that were being studied before the productive phase of the Human Genome Project. This means that the genes most frequently investigated by the research community tend to be only marginally more important to human physiology and disease than a random selection of genes. Based on an analysis of 10,395 research publications about SARS-CoV-2 that mention at least one human gene, we report here that the COVID-19 literature up to mid-October 2020 follows a similar pattern. This means that a large number of host genes that have been implicated in SARS-CoV-2 infection by four genome-wide studies remain unstudied. While quantifying the consequences of this neglect is not possible, they could be significant.

    1. Computational and Systems Biology
    2. Immunology and Inflammation
    Antonio Cappuccio et al.
    Tools and Resources

    From cellular activation to drug combinations, immunological responses are shaped by the action of multiple stimuli. Synergistic and antagonistic interactions between stimuli play major roles in shaping immune processes. To understand combinatorial regulation, we present the immune Synergistic/Antagonistic Interaction Learner (iSAIL). iSAIL includes a machine learning classifier to map and interpret interactions, a curated compendium of immunological combination treatment datasets, and their global integration into a landscape of ~30,000 interactions. The landscape is mined to reveal combinatorial control of interleukins, checkpoints, and other immune modulators. The resource helps elucidate the modulation of a stimulus by interactions with other cofactors, showing that TNF has strikingly different effects depending on co-stimulators. We discover new functional synergies between TNF and IFNβ controlling dendritic cell-T cell crosstalk. Analysis of laboratory or public combination treatment studies with this user-friendly web-based resource will help resolve the complex role of interaction effects on immune processes.