Progressive neural engagement within the IFG-pMTG circuit as gesture and speech entropy and MI advances

  1. CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing, China
  2. Department of Psychology, University of Chinese Academy of Sciences, Beijing, China
  3. School of Psychology, Central China Normal University, Wuhan, China
  4. CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China
  5. Chinese Institute for Brain Research, Beijing, China 102206

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Andrea Martin
    Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
  • Senior Editor
    Yanchao Bi
    Beijing Normal University, Beijing, China

Reviewer #1 (Public review):

Summary:

The authors quantified information in gesture and speech, and investigated the neural processing of speech and gestures in pMTG and LIFG, depending on their informational content, in 8 different time-windows, and using three different methods (EEG, HD-tDCS and TMS). They found that there is a time-sensitive and staged progression of neural engagement that is correlated with the informational content of the signal (speech/gesture).

Strengths:

A strength of the paper is that the authors attempted to combine three different methods to investigate speech-gesture processing.

Weaknesses:

(1) One major issue is that there is a tight anatomical coupling between pMTG and LIFG. Stimulating one area could therefore also result in stimulation of the other area (see Silvanto and Pascual-Leone, 2008). I therefore think it is very difficult to tease apart the contribution of these areas to the speech-gesture integration process, especially considering that the authors stimulate these regions in time windows that are very close to each other in both time and space (and the disruption might last longer over time).

(2) Related to this point, it is unclear to me why the HD-TDCS/TMS is delivered in set time windows for each region. How did the authors determine this, and how do the results for TMS compare to their previous work from 2018 and 2023 (which describes a similar dataset+design)? How can they ensure they are only targeting their intended region since they are so anatomically close to each other?

(3) As the EEG signal is often not normally distributed, I was wondering whether the authors checked the assumptions for their Pearson correlations. The authors could perhaps better choose to model the different variables to see whether MI/entropy could predict the neural responses. How did they correct the many correlational analyses that they have performed?

(4) The authors use ROIs for their different analyses, but it is unclear why and on the basis of what these regions are defined. Why not consider all channels without making them part of an ROI, by using a method like the one described in my previous comment?

(5) The authors describe that they have divided their EEG data into a "lower half" and a "higher half" (lines 234-236), based on entropy scores. It is unclear why this is necessary, and I would suggest just using the entropy scores as a continuous measure.

Reviewer #2 (Public review):

Summary:

The study is an innovative and fundamental study that clarified important aspects of brain processes for integration of information from speech and iconic gesture (i.e., gesture that depicts action, movement, and shape), based on tDCS, TMS, and EEG experiments. They evaluated their speech and gesture stimuli in information-theoretic ways and calculated how informative speech is (i.e., entropy), how informative gesture is, and how much shared information speech and gesture encode. The tDCS and TMS studies found that the left IFG and pMTG, the two areas that were activated in fMRI studies on speech-gesture integration in the previous literature, are causally implicated in speech-gesture integration. The size of tDC and TMS effects are correlated with the entropy of the stimuli or mutual information, which indicates that the effects stem from the modulation of information decoding/integration processes. The EEG study showed that various ERP (event-related potential, e.g., N1-P2, N400, LPC) effects that have been observed in speech-gesture integration experiments in the previous literature, are modulated by the entropy of speech/gesture and mutual information. This makes it clear that these effects are related to information decoding processes. The authors propose a model of how the speech-gesture integration process unfolds in time, and how IFG and pMTG interact with each other in that process.

Strengths:

The key strength of this study is that the authors used information theoretic measures of their stimuli (i.e., entropy and mutual information between speech and gesture) in all of their analyses. This made it clear that the neuro-modulation (tDCS, TMS) affected information decoding/integration and ERP effects reflect information decoding/integration. This study used tDCS and TMS methods to demonstrate that left IFG and pMTG are causally involved in speech-gesture integration. The size of tDCS and TMS effects are correlated with information-theoretic measures of the stimuli, which indicate that the effects indeed stem from disruption/facilitation of the information decoding/integration process (rather than generic excitation/inhibition). The authors' results also showed a correlation between information-theoretic measures of stimuli with various ERP effects. This indicates that these ERP effects reflect the information decoding/integration process.

Weaknesses:

The "mutual information" cannot fully capture the interplay of the meaning of speech and gesture. The mutual information is calculated based on what information can be decoded from speech alone and what information can be decoded from gesture alone. However, when speech and gesture are combined, a novel meaning can emerge, which cannot be decoded from a single modality alone. When example, a person produces a gesture of writing something with a pen, while saying "He paid". The speech-gesture combination can be interpreted as "paying by signing a cheque". It is highly unlikely that this meaning is decoded when people hear speech only or see gestures only. The current study cannot address how such speech-gesture integration occurs in the brain, and what ERP effects may reflect such a process. Future studies can classify different types of speech-gesture integration and investigate neural processes that underlie each type. Another important topic for future studies is to investigate how the neural processes of speech-gesture integration change when the relative timing between the speech stimulus and the gesture stimulus changes.

Reviewer #3 (Public review):

In this useful study, Zhao et al. try to extend the evidence for their previously described two-step model of speech-gesture integration in the posterior Middle Temporal Gyrus (pMTG) and Inferior Frontal Gyrus (IFG). They repeat some of their previous experimental paradigms, but this time quantifying Information-Theoretical (IT) metrics of the stimuli in a stroop-like paradigm purported to engage speech-gesture integration. They then correlate these metrics with the disruption of what they claim to be an integration effect observable in reaction times during the tasks following brain stimulation, as well as documenting the ERP components in response to the variability in these metrics.

The integration of multiple methods, like tDCS, TMS, and ERPs to provide converging evidence renders the results solid. However, their interpretation of the results should be taken with care, as some critical confounds, like difficulty, were not accounted for, and the conceptual link between the IT metrics and what the authors claim they index is tenuous and in need of more evidence. In some cases, the difficulty making this link seems to arise from conceptual equivocation (e.g., their claims regarding 'graded' evidence), whilst in some others it might arise from the usage of unclear wording in the writing of the manuscript (e.g. the sentence 'quantitatively functional mental states defined by a specific parser unified by statistical regularities'). Having said that, the authors' aim is valuable, and addressing these issues would render the work a very useful approach to improve our understanding of integration during semantic processing, being of interest to scientists working in cognitive neuroscience and neuroimaging.

The main hurdle to achieving the aims set by the authors is the presence of the confound of difficulty in their IT metrics. Their measure of entropy, for example, being derived from the distribution of responses of the participants to the stimuli, will tend to be high for words or gestures with multiple competing candidate representations (this is what would presumptively give rise to the diversity of responses in high-entropy items). There is ample evidence implicating IFG and pMTG as key regions of the semantic control network, which is critical during difficult semantic processing when, for example, semantic processing must resolve competition between multiple candidate representations, or when there are increased selection pressures (Jackson et al., 2021). Thus, the authors' interpretation of Mutual Information (MI) as an index of integration is inextricably contaminated with difficulty arising from multiple candidate representations. This casts doubt on the claims of the role of pMTG and IFG as regions carrying out gesture-speech integration as the observed pattern of results could also be interpreted in terms of brain stimulation interrupting the semantic control network's ability to select the best candidate for a given context or respond to more demanding semantic processing.

In terms of conceptual equivocation, the use of the term 'graded' by the authors seems to be different from the usage commonly employed in the semantic cognition literature (e.g., the 'graded hub hypothesis', Rice et al., 2015). The idea of a graded hub in the controlled semantic cognition framework (i.e., the anterior temporal lobe) refers to a progressive degree of abstraction or heteromodal information as you progress through the anatomy of the region (i.e., along the dorsal-to-ventral axis). The authors, on the other hand, seem to refer to 'graded manner' in the context of a correlation of entropy or MI and the change in the difference between Reaction Times (RTs) of semantically congruent vs incongruent gesture-speech. The issue is that the discourse through parts of the introduction and discussion seems to conflate both interpretations, and the ideas in the main text do not correspond to the references they cite. This is not overall very convincing. What is it exactly the authors are arguing about the correlation between RTs and MI indexes? As stated above, their measure of entropy captures the spread of responses, which could also be a measure of item difficulty (more diverse responses imply fewer correct responses, a classic index of difficulty). Capturing the diversity of responses means that items with high entropy scores are also likely to have multiple candidate representations, leading to increased selection pressures. Regions like pMTG and IFG have been widely implicated in difficult semantic processing and increased selection pressures (Jackson et al., 2021). How is this MI correlation evidence of integration that proceeds in a 'graded manner'? The conceptual links between these concepts must be made clearer for the interpretation to be convincing.

Author response:

Responses to Editors:

We appreciate Reviewer 1’s first concern regarding the difficulty of disentangling the contributions of tightly-coupled brain regions to the speech-gesture integration process—particularly due to the close temporal and spatial proximity of the stimulation windows and the potential for prolonged disruption. We would like to provide clarification and evidence supporting the validity of our methodology.

Our previous study (Zhao et al., 2021, J. Neurosci) employed the same experimental protocol—using inhibitory double-pulse transcranial magnetic stimulation (TMS) over the inferior frontal gyrus (IFG) and posterior middle temporal gyrus (pMTG) in one of eight 40-ms time windows. The findings from that study demonstrated a time-window-selective disruption of the semantic congruency effect (i.e., reaction time costs driven by semantic conflict), with no significant modulation of the gender congruency effect (i.e., reaction time costs due to gender conflict). This result establishes that double-pulse TMS provides sufficient temporal precision to independently target the left IFG and pMTG within these 40-ms windows during gesture-speech integration. Importantly, by comparing the distinctively inhibited time windows for IFG and pMTG, we offered clear evidence of distinct engagement and temporal dynamics between these regions during different stages of gesture-speech semantic processing.

Furthermore, we reviewed prior studies utilizing double-pulse TMS on structurally and functionally connected brain regions to explore neural contributions across timescales as brief as 3–60 ms. These studies, which encompass areas from the tongue and lip areas of the primary motor cortex (M1) to high-level semantic regions such as the pMTG and ATL (Author response table 1), consistently demonstrate the methodological rigor and precision of double-pulse TMS in disentangling the neural dynamics of different regions within these short temporal windows.

Author response table 1.

Double-pulse TMS studies on brain regions over 3-60 ms time interval

Response to Reviewer #1:

(1) For concern on the difficulty of disentangling the contributions of tightly-coupled brain regions to the speech-gesture integration process:

We trust that the explanation provided above has clarified this issue.

(2) For concern on the rationale for delivering HD-tDCS/TMS in set time windows for each region, as well as how these time windows were determined and how the current results compare to our previous studies from 2018 and 2023:

The current study builds on a series of investigations that systematically examined the temporal and spatial dynamics of gesture-speech integration. In our earlier work (Zhao et al., 2018, J. Neurosci), we demonstrated that interrupting neural activity in the IFG or pMTG using TMS selectively disrupted the semantic congruency effect (reaction time costs due to semantic incongruence), without affecting the gender congruency effect (reaction time costs due to gender incongruence). These findings identified the IFG and pMTG as critical hubs for gesture-speech integration. This informed the brain regions selected for subsequent studies.

In Zhao et al. (2021, J. Neurosci), we employed a double-pulse TMS protocol, delivering stimulation within one of eight 40-ms time windows, to further examine the temporal involvement of the IFG and pMTG. The results revealed time-window-selective disruptions of the semantic congruency effect, confirming the dynamic and temporally staged roles of these regions during gesture-speech integration.

In Zhao et al. (2023, Frontiers in Psychology), we investigated the semantic predictive role of gestures relative to speech by comparing two experimental conditions: (1) gestures preceding speech by a fixed interval of 200 ms, and (2) gestures preceding speech at its semantic identification point. We observed time-window-selective disruptions of the semantic congruency effect in the IFG and pMTG only in the second condition, leading to the conclusion that gestures exert a semantic priming effect on co-occurring speech. These findings underscored the semantic advantage of gesture in facilitating speech integration, further refining our understanding of the temporal and functional interplay between these modalities.

The design of the current study—including the choice of brain regions and time windows—was directly informed by these prior findings. Experiment 1 (HD-tDCS) targeted the entire gesture-speech integration process in the IFG and pMTG to assess whether neural activity in these regions, previously identified as integration hubs, is modulated by changes in informativeness from both modalities (i.e., entropy) and their interactions (mutual information, MI). The results revealed a gradual inhibition of neural activity in both areas as MI increased, evidenced by a negative correlation between MI and the tDCS inhibition effect in both regions. Building on this, Experiments 2 and 3 employed double-pulse TMS and event-related potentials (ERPs) to further assess whether the engaged neural activity was both time-sensitive and staged. These experiments also evaluated the contributions of various sources of information, revealing correlations between information-theoretic metrics and time-locked brain activity, providing insights into the ‘gradual’ nature of gesture-speech integration.

We acknowledge that the rationale for the design of the current study was not fully articulated in the original manuscript. In the revised version, we will provide a more comprehensive and coherent explanation of the logic behind the three experiments, ensuring clear alignment with our previous findings.

(3) For concern about the use of Pearson correlation and the normality of EEG data.

We appreciate the reviewer’s thoughtful consideration. In Figure 5 of the manuscript, we have already included normal distribution curves that illustrate the relationships between the average ERP amplitudes within each ROI or elicited clusters and the three information models. Additionally, multiple comparisons were addressed using FDR correction, as outlined in the manuscript.

To further clarify the data, we will calculate the Shapiro-Wilk test, a widely accepted method for assessing bivariate normality, for both the MI/entropy and averaged ERP data. The corresponding p-values will be provided in the following-up point-to-point responses.

(4) For concern about the ROI selection, and the suggestion of using whole-brain electrodes to build models of different variables (MI/entropy) to predict neural responses:

For the EEG data, we conducted both a traditional region-of-interest (ROI) analysis, with ROIs defined based on a well-established work (Habets et al., 2011), and a cluster-based permutation approach, which utilizes data-driven permutations to enhance robustness and address multiple comparisons. The latter method complements the hypothesis-driven ROI analysis by offering an exploratory, unbiased perspective. Notably, the results from both approaches were consistent, reinforcing the reliability of our findings.

To make the methods more accessible to a broader audience, we will provide a clear description of the methods used and how they relate to each other in the revised manuscript.

Reference:

Habets, B., Kita, S., Shao, Z.S., Ozyurek, A., and Hagoort, P. (2011). The Role of Synchrony and Ambiguity in Speech-Gesture Integration during Comprehension. J Cognitive Neurosci 23, 1845-1854. 10.1162/jocn.2010.21462

(5) For concern about the median split of the data:

To identify ERP components or spatiotemporal clusters that demonstrated significant semantic differences, we split each model into higher and lower halves, focusing on indexing information changes reflected by entropy or mutual information (MI). To illustrate the gradual activation process, the identified components and clusters were further analyzed for correlations with each information matrix. Remarkably, consistent results were observed between the ERP components and clusters, providing robust evidence that semantic information conveyed through gestures and speech significantly influenced the amplitude of these components or clusters. Moreover, the semantic information was shown to be highly sensitive, varying in tandem with these amplitude changes.

We acknowledge that the rationale behind this approach may not have been sufficiently clear in the initial manuscript. In our revision, we will ensure a more detailed and precise explanation to enhance the clarity and coherence of this logical framework.

Response to Reviewer #2:

We greatly appreciate Reviewer2 ’s concern regarding whether "mutual information" adequately captures the interplay between the meanings of speech and gesture. We would like to clarify that the materials used in the present study involved gestures performed without actual objects, paired with verbs that precisely describe the corresponding actions. For example, a hammering gesture was paired with the verb “hammer”, and a cutting gesture was paired with the verb “cut”. In this design, all gestures conveyed redundant meaning relative to the co-occurring speech, creating significant overlap between the information derived from speech alone and that from gesture alone.

We understand the reviewer’s concern about cases where gestures and speech may provide complementary rather than redundant information. To address this, we have developed an alternative metric for quantifying information gains contributed by supplementary multisensory cues, which will be explored in a subsequent study. However, for the present study, we believe that the observed overlap in information serves as an indicator of the degree of multisensory convergence, a central focus of our investigation.

Regarding the reviewer’s concern about how the neural processes of speech-gesture integration may change with variations in the relative timing between speech and gesture stimuli, we would like to highlight findings from our previous study (Zhao, 2023, Frontiers in Psychology). In that study, we explored the semantic predictive role of gestures relative to speech under two conditions: (1) gestures preceding speech by a fixed interval of 200 ms, and (2) gestures preceding speech of its semantic identification point. Interestingly, only in the second condition did we observe time-window-selective disruptions of the semantic congruency effect in the IFG and pMTG. This led us to conclude that gestures play a semantic priming role for co-occurring speech. Building on this, we designed the present study with gestures preceding speech of its semantic identification point to reflect this semantic priming relationship. Additionally, ongoing research is exploring gesture and speech interactions in natural conversational settings to investigate whether the neural processes identified here are consistent across varying contexts.

To prevent any similar concerns from causing doubt among the audience and to ensure clarity regarding the follow-up study, we will provide a detailed discussion of the two issues in the revised manuscript.

Response to Reviewer #3:

The primary aim of this study is to investigate whether the degree of activity in the established integration hubs, IFG and pMTG, is influenced by the information provided by gesture-speech modalities and/or their interactions. While we provided evidence for the differential involvement of the IFG and pMTG by delineating their dynamic engagement across distinct time windows of gesture-speech integration and associating these patterns with unisensory information and their interaction, we acknowledge that the mechanisms underlying these dynamics remain open to interpretation. Specifically, whether the observed effects stem from difficulties in semantic control processes, as suggested by Reviewer 3, or from resolving information uncertainty, as quantified by entropy, falls outside the scope of the current study. Importantly, we view these two interpretations as complementary rather than mutually exclusive, as both may be contributing factors. Nonetheless, we agree that addressing this question is a compelling avenue for future research. In the revised manuscript, we will include an exploratory analysis to investigate whether the confounding difficulty, stemming from the number of lexical or semantic representations, is limited to high-entropy items. Additionally, we will address and discuss alternative interpretations.

Regarding the concern of conceptual equivocation, we would like to emphasize that this study represents the first attempt to focus on the relationship between information quantity and neural engagement. In our initial presentation, we inadvertently conflated the commonly used term "graded hub," which refers to anatomical distribution, with its usage in the present context. We sincerely apologize for this oversight and are grateful for the reviewer’s careful critique. In the revised manuscript, we will clearly articulate the study’s objectives, clarify the representations of entropy and mutual information, and accurately describe their association with neural engagement.

Reference

Teige, C., Mollo, G., Millman, R., Savill, N., Smallwood, J., Cornelissen, P. L., & Jefferies, E. (2018). Dynamic semantic cognition: Characterising coherent and controlled conceptual retrieval through time using magnetoencephalography and chronometric transcranial magnetic stimulation. Cortex, 103, 329-349.

Amemiya, T., Beck, B., Walsh, V., Gomi, H., & Haggard, P. (2017). Visual area V5/hMT+ contributes to perception of tactile motion direction: a TMS study. Scientific reports, 7(1), 40937.

Muessgens, D., Thirugnanasambandam, N., Shitara, H., Popa, T., & Hallett, M. (2016). Dissociable roles of preSMA in motor sequence chunking and hand switching—a TMS study. Journal of Neurophysiology, 116(6), 2637-2646.

Vernet, M., Brem, A. K., Farzan, F., & Pascual-Leone, A. (2015). Synchronous and opposite roles of the parietal and prefrontal cortices in bistable perception: a double-coil TMS–EEG study. Cortex, 64, 78-88.

Pitcher, D. (2014). Facial expression recognition takes longer in the posterior superior temporal sulcus than in the occipital face area. Journal of Neuroscience, 34(27), 9173-9177.

Bardi, L., Kanai, R., Mapelli, D., & Walsh, V. (2012). TMS of the FEF interferes with spatial conflict. Journal of cognitive neuroscience, 24(6), 1305-1313.

D’Ausilio, A., Bufalari, I., Salmas, P., & Fadiga, L. (2012). The role of the motor system in discriminating normal and degraded speech sounds. Cortex, 48(7), 882-887.

Pitcher, D., Duchaine, B., Walsh, V., & Kanwisher, N. (2010). TMS evidence for feedforward and feedback mechanisms of face and body perception. Journal of Vision, 10(7), 671-671.

Gagnon, G., Blanchet, S., Grondin, S., & Schneider, C. (2010). Paired-pulse transcranial magnetic stimulation over the dorsolateral prefrontal cortex interferes with episodic encoding and retrieval for both verbal and non-verbal materials. Brain Research, 1344, 148-158.

Kalla, R., Muggleton, N. G., Juan, C. H., Cowey, A., & Walsh, V. (2008). The timing of the involvement of the frontal eye fields and posterior parietal cortex in visual search. Neuroreport, 19(10), 1067-1071.

Pitcher, D., Garrido, L., Walsh, V., & Duchaine, B. C. (2008). Transcranial magnetic stimulation disrupts the perception and embodiment of facial expressions. Journal of Neuroscience, 28(36), 8929-8933.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation