Introduction

Semantic representation, distinguished by its cohesive conceptual nature, emerges from distributed modality-specific regions. Consensus acknowledges the presence of ’convergence zones’ within the temporal and inferior parietal areas 1, or the ’semantic hub’ located in the anterior temporal lobe2, pivotal for integrating, converging, or distilling multimodal inputs. Contemporary perspectives on semantic processing portray it as a sequence of quantitatively functional mental states defined by a specific parser3, unified by statistical regularities among multiple sensory inputs4 through hierarchical prediction and multimodal interactions59. Hence, proposals suggest that the coherent semantic representation emerges from statistical learning mechanisms within these ’convergence zones’ or ’semantic hub’ 1012, potentially functioning in a graded manner12,13. However, the exact nature of the graded structure within these integration hubs, along with their temporal dynamics, remains elusive.

Among the many kinds of multimodal extralinguistic information, representational gesture is the one that is related to the semantic content of co-occurring speech14,15. Representational gesture is regarded as ‘part of language’16 or functional equivalents of lexical units that alternate and integrate with speech into a ‘single unification space’ to convey a coherent meaning1719. Empirical studies have investigated the semantic integration between representational gesture (gesture in short hereafter) and speech by manipulating their semantic relationship2023 and revealed a mutual interaction between them2426 as reflected by the N400 latency and amplitude19 as well as common neural underpinnings in the left inferior frontal gyrus (IFG) and posterior middle temporal gyrus (pMTG)20,27,28. Quantifying the amount of information from both sources and their interaction, the present study delved into cortical engagement and temporal dynamics during multisensory gesture-speech integration, with a specific focus on the IFG and pMTG, alongside various ERP components.

To this end, we developed an analytic approach to directly probe the contribution of gesture and speech during multisensory semantic integration, while adopting the information-theoretic complexity metrics of entropy and mutual information (MI). Entropy captures the disorder or randomness of information and is used as a measurement of the uncertainty of representation activated when an event occurs29. MI illustrates the mutual constraint that the two variables impose on each other30. Herein, during gesture-speech integration, entropy measures the uncertainty of information of gesture or speech, while MI indexes the degree of integration.

Three experiments were conducted to unravel the intricate neural processes underlying gesture-speech semantic integration. In Experiment 1, High-Definition Transcranial Direct Current Stimulation (HD-tDCS) was utilized to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG. HD-tDCS induces membrane depolarization with anodal stimulation and membrane hyperpolarisation with cathodal stimulation31, thereby respectively increasing or decreasing cortical excitability in the targeted brain area. Hence, Experiment 1 aimed to determine whether the facilitation effect (Anodal-tDCS minus Sham-tDCS) and/or the inhibitory effect (Cathodal-tDCS minus Sham-tDCS) on the integration hubs of IFG and/or pMTG were modulated by the degree of gesture-speech integration, indexed with MI. Considering the different roles of IFG and pMTG during integration28, as well as the various ERP components reported in prior investigations, such as the early sensory effect as P1 and N1–P233,34, the N400 semantic conflict effect19,34,35, and the late positive component (LPC) reconstruction effect36,37. Experiment 2 employed chronometric double-pulse transcranial magnetic stimulation (TMS) to target short time windows along the gesture-speech integration period32. In parallel, Experiment 3 utilized high-temporal event-related potentials to explore whether the various neural engagements were temporally and progressively modulated by distinct information contributors during gesture-speech integration.

Material and methods

Participants

Ninety-eight young Chinese participants signed written informed consent forms and took part in the present study (Experiment 1: 29 females, 23 males, age = 20 ± 3.40 years; Experiment 2: 11 females, 13 males, age = 23 ± 4.88 years; Experiment 3: 12 females, 10 males, age = 21 ± 3.53 years). All of the participants were right-handed (Experiment 1: laterality quotient (LQ)38 = 88.71 ± 13.14; Experiment 2: LQ = 89.02 ± 13.25; Experiment 3: LQ = 88.49 ± 12.65), had normal or corrected-to-normal vision and were paid ¥100 per hour for their participation. All experiments were approved by the Ethics Committee of the Institute of Psychology, Chinese Academy of Sciences.

Stimuli

Twenty gestures (Appendix Table 1) with 20 semantically congruent speech signals taken from previous study28 were used. The stimuli set were recorded from two native Chinese speakers (1 male, 1 female) and validated by replicating the semantic congruency effect with 30 participants. Results showed a significantly (t(29) = 7.16, p < 0.001) larger reaction time when participants were asked to judge the gender of the speaker if gesture contained incongruent semantic information with speech (a ‘cut’ gesture paired with speech word ‘喷 pen1 (spray)’: mean = 554.51 ms, SE = 11.65) relative to when they were semantically congruent (a ‘cut’ gesture paired with ‘剪 jian3 (cut)’ word: mean = 533.90 ms, SE = 12.02)28.

Additionally, two separate pre-tests with 30 subjects in each (pre-test 1: 16 females, 14 males, age = 24 ± 4.37 years; pre-test 2: 15 females, 15 males, age = 22 ± 3.26 years) were conducted to determine the comprehensive values of gesture and speech. Participants were presented with segments of increasing duration, beginning at 40 ms, and were prompted to provide a single verb to describe either the isolated gesture they observed (pre-test 1) or the isolated speech they heard (pre-test 2). For each pre-test, the response consistently provided by participants for four to six consecutive instances was considered the comprehensive answer for the gesture or speech. The initial instance duration was marked as the discrimination point (DP) for gesture (mean = 183.78 ± 84.82ms) or the identification point (IP) for speech (mean = 176.40 ± 66.21ms) (Figure 1A top).

Experimental design, and stimulus characteristics.

(A) Experimental stimuli. Twenty gestures were paired with 20 relevant speech stimuli. Two gating studies were executed to define the minimal length of each gesture and speech required for semantic identification, namely, the discrimination point (DP) of gesture and the identification point (IP) of speech. Overall, a mean of 183.78 ms (SD = 84.82) was found for the DP of gestures and the IP of speech was 176.40 ms (SD = 66.21). The onset of speech was set at the gesture DP. Responses for each item were assessed utilizing information-theoretic complexity metrics to quantify the information content of both gesture and speech during integration, employing entropy and MI.

(B) Procedure of Experiment 1. HD-tDCS, including Anodal, Cathodal, or Sham conditions, was administered to the IFG or pMTG) using a 4 * 1 ring-based electrode montage. Electrode F7 targeted the IFG, with return electrodes placed on AF7, FC5, F9, and FT9. For pMTG stimulation, TP7 was targeted, with return electrodes positioned on C5, P5, T9, and P9. Sessions lasted 20 minutes, with a 5-second fade-in and fade-out, while the Sham condition involved only 30 seconds of stimulation.

(C) Procedure of Experiment 2. Eight time windows (TWs, duration = 40 ms) were segmented in relative to the speech IP. Among the eight TWs, five (TW1, TW2, TW3, TW6, and TW7) were chosen based on the significant results in our prior study28. Double-pulse TMS was delivered over each of the TW of either the pMTG or the IFG.

(D) Procedure of Experiment 3. Semantically congruent gesture-speech pairs were presented randomly with Electroencephalogram (EEG) recorded simultaneously. Epochs were time locked to the onset of speech and lasted for 1000 ms. A 200 ms pre-stimulus baseline correction was applied before the onset of gesture stoke. Various elicited components were hypothesized.

(E-F) Proposed gradations in cortical engagements during gesture-speech information changes. Stepwise variations in the quantity of gesture and speech information during integration, as characterized by information theory metrics (E), are believed to the underpinned by progressive neural engagement within the IFG-pMTG gesture-speech integration circuit (F).

To quantify information content, responses for each item were converted into Shannon’s entropy (H) as a measure of information richness (Figure 1A bottom). With no significant gender differences observed in both gesture (t(20) = 0.21, p = 0.84) and speech (t(20) = 0.52, p = 0.61), responses were aggregated across genders, resulting in 60 answers per item (Appendix Table 2). Here, p(xi) and p(yi) represent the distribution of 60 answers for a given gesture (Appendix Table 2B) and speech (Appendix Table 2A), respectively. High entropy indicates diverse answers, reflecting broad representation, while low entropy suggests focused lexical recognition for a specific item (Figure 2B). The joint entropy computation for gesture and speech, represented by H(xi, yi), involved amalgamating datasets of gesture and speech responses to depict their combined distributions. For specific gesture-speech combinations, equivalence between the joint entropy and the sum of individual entropies (gesture or speech) indicates absence of overlap in response sets. Conversely, significant overlap, denoted by a considerable number of shared responses between gesture and speech datasets, leads to a noticeable discrepancy between joint entropy and the sum of gesture and speech entropies. This quantification of gesture-speech overlap was operationalized by subtracting the joint entropy of gesture-speech from the combined entropies of gesture and speech, indexed by Mutual Information (MI) (see Appendix Table 2C). Elevated MI values thus signify substantial overlap, indicative of a robust mutual interaction between gesture and speech. The quantitative information for each stimulus, including gesture entropy, speech entropy, joint entropy, and MI are displayed in Appendix Table 3.

Quantification formulas (A) and distributions of each stimulus in Shannon’s entropy (B).

Two separate pre-tests (N = 30) were conducted to assign a single verb for describing each of the isolated 20 gestures and 20 speech items. Responses provided for each item were transformed into Shannon’s entropy using a relative quantification formula. Gesture (B left) and speech (B right) entropy quantify the randomness of gestural or speech information, representing the uncertainty of probabilistic representation activated when a specific stimulus occurs. Joint entropy (B middle) captures the widespread nature of the two sources of information combined. Mutual information (MI) was calculated as the difference between joint entropy with gesture entropy and speech entropy combined (A), thereby capturing the overlap of gesture and speech and representing semantic integration.

To accurately assess whether entropy/MI corresponds to stepped neural changes, the current study aggregated neural responses (Non-invasive brain stimulation (NIBS) inhibition effect or ERP amplitude) with identical entropy or MI values prior to conducting correlational analyses.

Experimental procedure

Adopting a semantic priming paradigm of gestures onto speech16,32, speech onset was set to be at the DP of each accompanying gesture. An irrelevant factor of gender congruency (e.g., a man making a gesture combined with a female voice) was created27,28,39. This involved aligning the gender of the voice with the corresponding gender of the gesture in either a congruent (e.g., male voice paired with a male gesture) or incongruent (e.g., male voice paired with a female gesture) manner. This approach served as a direct control mechanism, facilitating the investigation of the automatic and implicit semantic interplay between gesture and speech39. In light of previous findings indicating a distinct TMS-disruption effect on the semantic congruency of gesture-speech interactions28, both semantically congruent and incongruent pairs were included in Experiment 1 and Experiment 2. Experiment 3, conversely, exclusively utilized semantically congruent pairs to elucidate ERP metrics indicative of nuanced semantic progression.

Gesture–speech pairs were presented randomly using Presentation software (www.neurobs.com). Participants underwent Experiment 1, comprising 480 gesture-speech pairs, across three separate sessions spaced one week apart for each participant. In each session, participants received one of three stimulation types (Anodal, Cathodal, or Sham). Experiment 2 consisted of 800 pairs and was conducted across 15 blocks over three days, with one week between sessions. The order of stimulation site and time window (TW) was counterbalanced using a Latin square design. Experiment 3, comprising 80 gesture-speech pairs, was completed in a single-day session. Participants were asked to look at the screen but respond with both hands as quickly and accurately as possible merely to the gender of the voice they heard. The RT and the button being pressed were recorded. The experiment started with a fixation cross presented on the center of the screen, which lasted for 0.5-1.5 sec.

Experiment 1: HD-tDCS protocol and data analysis

HD-tDCS protocol employed a constant current stimulator (The Starstim 8 system) delivering stimulation at an intensity of 2000mA. A 4 * 1 ring-based electrode montage was utilized, comprising a central electrode (stimulation) positioned directly over the target cortical area and four return electrodes encircling it to provide focused stimulation. For targeting the left IFG at Montreal Neurological Institute (MNI) coordinates (-62, 16, 22), electrode F7 was selected as the optimal cortical projection site40, with the four return electrodes placed on AF7, FC5, F9, and FT9. For stimulation of the pMTG at coordinates (-50, -56, 10), TP7 was identified as the cortical projection site40, with return electrodes positioned on C5, P5, T9, and P9. The stimulation parameters included a 20-minute duration with a 5-second fade-in and fade-out for both Anodal and Cathodal conditions. The Sham condition involved a 5-second fade-in followed by only 30 seconds of stimulation, then 19’20 minutes of no stimulation, and finally a 5-second fade-out (Figure 1B). Stimulation was controlled using NIC software, with participants blinded to the stimulation conditions.

All incorrect responses (702 out of the total number of 24960, 2.81% of trials) were excluded. To eliminate the influence of outliers, a 2SD trimmed mean for every participant in each session was also calculated. Our present analysis focused on Pearson correlations between the interruption effects of HD-tDCS (active tDCS minus sham tDCS) on the semantic congruency effect (difference in reaction time between semantic incongruent and semantic congruent pairs) and the variables of gesture entropy, speech entropy, or MI. This methodology seeks to determine whether the neural activity within the left IFG and pMTG is gradually affected by varying levels of gesture and speech information during integration, as quantified by entropy and MI.

Experiment 2: TMS protocol and data analysis

At an intensity of 50% of the maximum stimulator output, double-pulse TMS was delivered via a 70 mm figure-eight coil using a Magstim Rapid² stimulator (Magstim, UK) over either the left IFG in TW3 (-40∼0 ms in relative to speech identification point (IP)) and TW6 (80∼120 ms,) or the left pMTG in TW1 (-120 ∼ -80 ms), TW2 (-80 ∼ -40 ms) and TW7 (120∼160 ms). Among the TWs that covering the period of gesture-speech integration, those that showed a TW-selective disruption of gesture-speech integration were selected28 (Figure 1C).

High-resolution (1 × 1 × 0.6 mm) T1-weighted MRI scans were obtained using a Siemens 3T Trio/Tim Scanner for image-guided TMS navigation. Frameless stereotaxic procedures (BrainSight 2; Rogue Research) allowed real-time stimulation monitoring. To ensure precision, individual anatomical images were manually registered by identifying the anterior and posterior commissures. Subject-specific target regions were defined using trajectory markers in the MNI coordinate system. Vertex was used as control.

All incorrect responses (922 out of the total number of 19200, 4.8% of trials) were excluded. We focused our analysis on Pearson correlations of the TMS interruption effects (active TMS minus vertex TMS) of the semantic congruency effect with the gesture entropy, speech entropy or MI. By doing this, we can determine how the time-sensitive contribution of the left IFG and pMTG to gesture–speech integration was affected by gesture and speech information distribution. FDR correction was applied for multiple comparisons.

Experiment 3: Electroencephalogram (EEG) recording and data analysis

EEG were recorded from 48 Ag/AgCl electrodes mounted in a cap according to the 10-20 system41, amplified with a PORTI-32/MREFA amplifier (TMS International B.V., Enschede, NL) and digitized online at 500 Hz (bandpass, 0.01-70 Hz). EEGLAB, a MATLAB toolbox, was used to analyze the EEG data42. Vertical and horizontal eye movements were measured with 4 electrodes placed above the left eyebrow, below the left orbital ridge and at bilateral external canthus. All electrodes were referenced online to the left mastoid. Electrode impedance was maintained below 5 KΩ. The average of the left and right mastoids was used for re-referencing. A high-pass filter with a cutoff of 0.05 Hz and a low-pass filter with a cutoff of 30 Hz were applied. Semi-automated artifact removal, including independent component analysis (ICA) for identifying components of eye blinks and muscle activity, was performed (Figure 1D). Participants with rejected trials exceeding 30% of their total were excluded from further analysis.

All incorrect responses were excluded (147 out of 1760, 8.35% of trials). To eliminate the influence of outliers, a 2 SD trimmed mean was calculated for every participant in each condition. Data were epoched from the onset of speech and lasted for 1000 ms. To ensure a clean baseline with no stimulus presented, a 200 ms pre-stimulus baseline correction was applied before gesture onset.

To objectively identify the time windows of activated components, grand-average ERPs at electrode Cz were compared between the higher (≥50%) and lower (<50%) halves for gesture entropy (Figure 5A1), speech entropy (Figure 5B1), and MI (Figure 5C1). Consequently, four ERP components were predetermined: the P1 effect observed within the time window of 0-100 ms33,34, the N1-P2 effect observed between 150-250ms33,34, the N400 within the interval of 250-450ms19,34,35, and the LPC spanning from 550-1000ms36,37. Additionally, seven regions-of-interest (ROIs) were defined in order to locate the modulation effect on each ERP component: left anterior (LA): F1, F3, F5, FC1, FC3, and FC5; left central (LC): C1, C3, C5, CP1, CP3, and CP5; left posterior (LP): P1, P3, P5, PO3, PO5, and O1; right anterior (RA): F2, F4, F6, FC2, FC4, and FC6; right central (RC): C2, C4, C6, CP2, CP4, and CP6; right posterior (RP): P2, P4, P6, PO4, PO6, and O2; and midline electrodes (ML): Fz, FCz, Cz, Pz, Oz, and CPz43.

Subsequently, cluster-based permutation tests44 in Fieldtrip was further used to determine the significant clusters of adjacent time points and electrodes of ERP amplitude between the higher and lower halves of gesture entropy, speech entropy and MI, respectively. The electrode-level type I error threshold was set to 0.025. Cluster-level statistic was estimated through 5000 Monte Carlo simulations, where the cluster-level statistic is the sum of T-values for each stimulus within a cluster. The cluster-level type I error threshold was set to 0.05. Clusters with a p-value less than the critical alpha-level are considered to be conditionally different.

Paired t-tests were conducted to compare the lower and upper halves of each information model for the averaged amplitude within each ROI or cluster across the four ERP time windows, separately. Pearson correlations were calculated between each model value and each averaged ERP amplitude in each ROI or cluster, individually. False discovery rate (FDR) correction was applied for multiple comparisons.

Results

Experiment 1: Modulation of left pMTG and IFG engagement by gradual changes in gesture-speech semantic information

In the IFG, one-way ANOVA examining the effects of three tDCS conditions (Anodal, Cathodal, or Sham) on semantic congruency (RT (semantic incongruent) – RT (semantic congruent)) demonstrated a significant main effect of stimulation condition (F(2, 75) = 3.673, p = 0.030, ηp2 = 0.089). Post hoc paired t-tests indicated a significantly reduced semantic congruency effect between the Cathodal condition and the Sham condition (t(26) = -3.296, p = 0.003, 95% CI = [-11.488, 4.896]) (Figure 3A left). Subsequent Pearson correlation analysis revealed that the reduced semantic congruency effect was progressively associated with the MI, evidenced by a significant correlation between the Cathodal-tDCS effect (Cathodal-tDCS minus Sham-tDCS) and MI (r = -0.595, p = 0.007, 95% CI = [-0.995, -0.195]) (Figure 3B).

tDCS effect over semantic congruency.

(A) tDCS effect was defined as active-tDCS minus sham-tDCS. The semantic congruency effect was calculated as the reaction time (RT) difference between semantically incongruent and semantically congruent pairs.

(B) Correlations of the tDCS effect over the semantic congruency effect with three information models (gesture entropy, speech entropy and MI) are displayed with best-fitting regression lines. Significant correlations are marked in red. * p < 0.05, ** p < 0.01 after FDR correction.

Similarly, in the pMTG, a one-way ANOVA assessing the effects of three tDCS conditions on semantic congruency also revealed a significant main effect of stimulation condition (F(2, 75) = 3.250, p = 0.044, ηp2 = 0.080). Subsequent paired t-tests identified a significantly reduced semantic congruency effect between the Cathodal condition and the Sham condition (t(25) = -2.740, p = 0.011, 95% CI = [-11.915, 6.435]) (Figure 3A right). Moreover, a significant correlation was observed between the Cathodal-tDCS effect and MI (r = -0.457, p = 0.049, 95% CI = [-0.900, -0.014]) (Figure 3B). RTs of congruent and incongruent trials of IFG and pMTG in each of the stimulation conditions were shown in Appendix Table 4A.

Experiment 2: Time-sensitive modulation of left pMTG and IFG engagements by gradual changes in gesture-speech semantic information

A 2 (TMS effect: active - Vertex) × 5 (TW) ANOVA on semantic congruency revealed a significant interaction between TMS effect and TW (F(3.589, 82.538) = 3.273, p = 0.019, ηp2 = 0.125). Further t-tests identified a significant TMS effect over the pMTG in TW1 (t(23) = -3.068, p = 0.005, 95% CI = [-6.838, 0.702]), TW2 (t(23) = -2.923, p = 0.008, 95% CI = [-6.490, 0.644]), and TW7 (t(23) = -2.005, p = 0.047, 95% CI = [-5.628, 1.618]). In contrast, a significant TMS effect over the IFG was found in TW3 (t(23) = -2.335, p = 0.029, 95% CI = [-5.928, 1.258]), and TW6 (t(23) = -4.839, p < 0.001, 95% CI = [-7.617, -2.061]) (Figure 4A). Raw RTs of congruent and incongruent trials were shown in Appendix Table 4B.

TMS effect over semantic congruency.

(A) TMS effect was defined as active-TMS minus vertex-TMS. The semantic congruency effect was calculated as the reaction time (RT) difference between semantically incongruent and semantically congruent pairs.

(B) Correlations of the TMS effect over the semantic congruency effect with three information models (gesture entropy, speech entropy and MI) are displayed with best-fitting regression lines. Significant correlations are marked in red. * p < 0.05, ** p < 0.01, *** p < 0.001 after FDR correction.

Additionally, a significant negative correlation was found between the TMS effect (a more negative TMS effect represents a stronger interruption of the integration effect) and speech entropy when the pMTG was inhibited in TW2 (r = -0.792, p = 0.004, 95% CI = [-1.252, -0.331]). Meanwhile, when the IFG activity was interrupted in TW6, a significant negative correlation was found between the TMS effect and gesture entropy (r = -0.539, p = 0.014, 95% CI = [-0.956, -0.122]), speech entropy (r = -0.664, p = 0.026, 95% CI = [-1.255, -0.073]), and MI (r = -0.677, p = 0.001, 95% CI = [-1.054, -0.300]) (Figure 4B).

Experiment 3: Temporal modulation of P1, N1-P2, N400 and LPC components by gradual changes in gesture-speech semantic information

Topographical maps illustrating amplitude differences between the lower and higher halves of speech entropy demonstrate a central-posterior P1 amplitude (0-100 ms, Figure 5B2 middle). Aligning with prior findings33, the paired t-tests demonstrated a significantly larger P1 amplitude within the ML ROI (t(22) = 2.510, p = 0.020, 95% confidence interval (CI) = [1.66, 3.36]) when contrasting stimuli with higher 50% speech entropy against those with lower 50% speech entropy (Figure 5B2 left). Subsequent correlation analyses unveiled a significant increase in the P1 amplitude with the rise in speech entropy within the ML ROI (r = 0.609, p = 0.047, 95% CI = [0.039, 1.179], Figure 5B2 right). Furthermore, a cluster of neighboring time-electrode samples exhibited a significant contrast between the lower 50% and higher 50% of speech entropy, revealing a P1 effect spanning 16 to 78 ms at specific electrodes (FC2, FCz, C1, C2, Cz, and CPz, Figure 5B3 middle) (t(22) = 2.754, p = 0.004, 95% confidence interval (CI) = [1.65, 3.86], Figure 5B3 left), with a significant correlation with speech entropy (r = 0.636, p = 0.035, 95% CI = [0.081, 1.191], Figure 5B3 right).

ERP results of gesture entropy (A), speech entropy (B) or MI (C).

Four ERP components were identified from grand-average ERPs at the Cz electrode, contrasting trials with the lower 50% (red lines) and the higher 50% (blue lines) of gesture entropy, speech entropy or MI (Top panels). Clusters of adjacent time points and electrodes were subsequently identified within each component using a cluster-based permutation test (Bottom right). Topographical maps depict amplitude differences between the lower and higher halves of each information model, with significant ROIs or electrode clusters highlighted in black. Solid rectangles delineating the ROIs that exhibited the maximal correlation and paired t-values (Bottom left). T-test comparisons with normal distribution lines and correlations with best-fitting regression lines are calculated and illustrated between the average ERP amplitude within the rectangular ROI (Bottom left) or the elicited clusters (Bottom right) and the three information models individually. * p < 0.05, ** p < 0.01 after FDR correction.

Additionally, topographical maps comparing the lower 50% and higher 50% gesture entropy revealed a frontal N1-P2 amplitude (150-250 ms, Figure 5A2 middle). In accordance with previous findings on bilateral frontal N1-P2 amplitude33, paired t-tests displayed a significantly larger amplitude for stimuli with lower 50% gesture entropy than with higher 50% entropy in both ROIs of LA (t(22) = 2.820, p = 0.011, 95% CI = [2.21, 3.43]) and RA (t(22) = 2.223, p = 0.038, 95% CI = [1.56, 2.89]) (Figure 5A2 left). Moreover, a negative correlation was found between N1-P2 amplitude and gesture entropy in both ROIs of LA (r = -0.465, p = 0.039, 95% CI = [-0.87, -0.06]) and RA (r = -0.465, p = 0.039, 95% CI = [-0.88, -0.05]) (Figure 5A2 right). Additionally, through a cluster-permutation test, the N1-P2 effect was identified between 184 to 202 ms at electrodes FC4, FC6, C2, C4, C6, and CP4 (Figure 5A3 middle) (t(22) = 2.638, p = 0.015, 95% CI = [1.79, 3.48], (Figure 5A3 left)), exhibiting a significant correlation with gesture entropy (r = -0.485, p = 0.030, 95% CI = [-0.91, -0.06], Figure 5A3 right).

Furthermore, in line with prior research45, a left-frontal N400 amplitude (250-450 ms) was discerned from topographical maps of both gesture entropy (Figure 5A4 middle) and MI (Figure 5C2 middle). Notably, a larger N400 amplitude in the LA ROI was consistently observed for stimuli with lower 50% values compared to those with higher 50% values, both for gesture entropy (t(22) = 2.455, p = 0.023, 95% CI = [1.95, 2.96], Figure 5A4 left) and MI (t(22) = 3.00, p = 0.007, 95% CI = [2.54, 3.46], Figure 5C2 left). Concurrently, a negative correlation was noted between the N400 amplitude and both gesture entropy (r = -0.480, p = 0.032, 95% CI = [-0.94, -0.03], Figure 5A4 right) and MI (r = -0.504, p = 0.028, 95% CI = [-0.97, -0.04], Figure 5C2 right) in the LA ROI.

The identified clusters with the N400 effect for gesture entropy (282 – 318 ms at electrodes FC1, FCz, C1, and Cz, Figure 5A5 middle) (t(22) = 2.828, p = 0.010, 95% CI = [2.02, 3.64], Figure 5A5 left) exhibited significant correlation between the N400 amplitude and gesture entropy (r = -0.445, p = 0.049, 95% CI = [-0.88, -0.01], Figure 5A5 right). Similarly, the cluster with the N400 effect for MI (294 – 306 ms at electrodes F1, F3, Fz, FC1, FC3, FCz, and C1, Figure 5C3 middle) (t(22) = 2.461, p = 0.023, 95% CI = [1.62, 3.30], Figure 5C3 left) also exhibited significant correlation (r = -0.569, p = 0.011, 95% CI = [-0.98, -0.16], Figure 5C5 right).

Finally, consistent with previous findings33, an anterior LPC effect (550-1000 ms) was observed in topographical maps comparing stimuli with lower and higher 50% speech entropy (Figure 5B4 middle). The reduced LPC amplitude was evident in the paired t-tests conducted in ROIs of LA (t(22) = 2.614, p = 0.016, 95% CI = [1.88, 3.35]); LC (t(22) = 2.592, p = 0.017, 95% CI = [1.83, 3.35]); RA (t(22) = 2.520, p = 0.020, 95% CI = [1.84, 3.24]); and ML (t(22) = 2.267, p = 0.034, 95% CI = [1.44, 3.10]) (Figure 5B4 left). Simultaneously, a marked negative correlation with speech entropy was evidenced in ROIs of LA (r = -0.836, p = 0.001, 95% CI = [-1.26, -0.42]); LC (r = -0.762, p = 0.006, 95% CI = [-1.23, -0.30]); RA (r = -0.774, p = 0.005, 95% CI = [-1.23, -0.32]) and ML (r = -0.730, p = 0.011, 95% CI = [-1.22, -0.24]) (Figure 5B4 right). Additionally, a cluster with the LPC effect (644 - 688 ms at electrodes Cz, CPz, P1, and Pz, Figure 5B5 middle) (t(22) = 2.754, p = 0.012, 95% CI = [1.50, 4.01], Figure 5B5 left) displayed a significant correlation with speech entropy (r = -0.699, p = 0.017, 95% CI = [-1.24, -0.16], Figure 5B5 right).

Discussion

Through mathematical quantification of gesture and speech information using entropy and mutual information (MI), we examined the functional pattern and dynamic neural structure underlying multisensory semantic integration. Our results, for the first time, unveiled a progressive inhibition of IFG and pMTG by HD-tDCS as the degree of gesture-speech interaction, indexed by MI, advanced (Experiment 1). Additionally, the gradual neural engagement was found to be time-sensitive and staged, as evidenced by the selectively interrupted time windows (Experiment 2) and the distinct correlated ERP components (Experiment 3), which were modulated by top-down gesture constrain (gesture entropy) and bottom-up speech. These findings significantly expand our understanding of the cortical foundations of statistically regularized multisensory semantic information.

It is widely acknowledged that a single, amodal system mediates the interactions among perceptual representations of different modalities11,12,46. Moreover, observations have suggested that semantic dementia patients experience increasing overregularization of their conceptual knowledge due to the progressive deterioration of this amodal system47. Consequently, a graded function and structure of the transmodal ’hub’ representational system has been proposed12,48,49. In line with this, through the use of NIBS techniques such as HD-tDCS and TMS, the present study provides compelling evidence that the integration hubs of gesture and speech, namely the pMTG and IFG, function in a graded manner. This is supported by the progressive inhibition effect observed in these brain areas as the entropy and mutual information of gesture and speech advances.

Moreover, by dividing the potential integration period into eight TWs relative to the speech IP and administering inhibitory double-pulse TMS across each TW, the current study attributed the gradual TMS-selective regional inhibition to distinct information sources. In the early pre-lexical TW2 of gesture-speech integration, the suppression effect observed in the pMTG was correlated with speech entropy. Conversely, in the later post-lexical TW6, the IFG interruption effect was influenced by both gesture entropy, speech entropy, and their MI. A dual-stage pMTG-IFG-pMTG neurocircuit loop during gesture-speech integration has been proposed previous28. As an extension, the present study unveils a staged accumulation of engagement within the neurocircuit linking the transmodal regions of pMTG and IFG, arising from distinct contributors of information.

Furthermore, we disentangled the sub-processes of integration with high-temporal ERPs, when representations of gesture and speech were variously presented. Early P1-N1 and P2 sensory effects linked to perception and attentional processes34,50 was comprehended as a reflection of the early audiovisual gesture-speech integration in the sensory-perceptual processing chain51. Note that a semantic priming paradigm was adopted here to create a top-down prediction of gesture over speech. The observed positive correlation of the P1 effect with speech entropy and the negative correlation of the N1-P2 effect with gesture entropy suggest that the early interaction of gesture-speech information was modulated by both top-down gesture prediction and bottom-up speech processing. Additionally, the lexico-semantic effect of the N400 and the LPC were differentially mediated by top-down gesture prediction, bottom-up speech encoding and their interaction: the N400 was negatively correlated with both the gesture entropy and MI, but the LPC was negatively correlated only with the speech entropy. Nonetheless, activation of representation is modulated progressively. The input stimuli would activate a dynamically distributed neural landscape, the state of which constructs gradually as measured by entropy and MI and correlates with the electrophysiological signals (N400 and LPC) which indicate the change of lexical representation. Consistent with recent account in multisensory information processing4,52, our findings further confirm that the changed activation pattern can be induced from directions of both top-down and bottom-up gesture-speech processing.

Considering the close alignment of the ERP components with the TWs of TMS effect, it is reasonable to speculate the ERP components with the cortical involvements (Figure 6). Consequently, referencing the recurrent neurocircuit connecting the left IFG and pMTG for semantic unification53, we extended the previously proposed two-stage gesture-speech integration circuit28 into sequential steps. First, bottom-up speech processing mapping acoustic signal to its lexical representation was performed from the STG/S to the pMTG. The larger speech entropy was, the greater effort was made during the matching of the acoustic input with its stored lexical representation, thus leading to a larger involvement of the pMTG at pre-lexical stage (TW2) and a larger P1 effect (Figure 6 ①). Second, the gesture representation was activated in the pMTG and further exerted a top-down modulation over the phonological processing of speech in the STG/S54. The higher certainty of gesture, a larger modulation of gesture would be made upon speech, as indexed by a smaller gesture entropy with an enhanced N1-P2 amplitude (Figure 6②). Third, information was relayed from the pMTG to the IFG for sustained activation, during which a semantic constraint from gesture has been made on the semantic retrieval of speech. Greater TMS effect over the IFG at post-lexical stage (TW6) accompanying with a reduced N400 amplitude were found with the increase of gesture entropy, when the representation of gesture was wildly distributed and the constrain over the following speech was weak (Figure 6③). Fourth, the activated speech representation was compared with that of the gesture in the IFG. At this stage, the larger overlapped neural populations activated by gesture and speech as indexed by a larger MI, a greater TMS disruption effect of the IFG and a reduced N400 amplitude indexing easier integration and less semantic conflict were observed (Figure 6④). Last, the activated speech representation would disambiguate and reanalyze the semantic information that was stored in the IFG and further unify into a coherent comprehension in the pMTG17,55. The more uncertain information being provided by speech, as indicated by an increased speech entropy, a stronger reweighting effect was made over the activated semantic information, resulting in a strengthened involvement of the IFG as well as a reduced LPC amplitude (Figure 6⑤).

Progressive processing stages of gesture–speech information within the pMTG-IFG loop.

Correlations between the TMS disruption effect of pMTG and IFG with three information models are represented by the orange line and the green lines, respectively. Black lines denote the strongest correlations of ROI averaged ERP components with three information models. * p < 0.05, ** p < 0.01 after FDR correction.

Note that the sequential cortical involvement and ERP components discussed above are derived from a deliberate alignment of speech onset with gesture DP, creating an artificial priming effect with gesture semantically preceding speech. Caution is advised when generalizing these findings to the spontaneous gesture-speech relationships, although gestures naturally precede speech56.

Limitations exist. ERP components and cortical engagements were linked through intermediary variables of entropy and MI. Dissociations were observed between ERP components and cortical engagement. Importantly, there is no direct evidence of the brain structures underpinning the corresponding ERPs, necessitating clarification in future studies. Additionally, not all influenced TWs exhibited significant associations with entropy and MI. While HD-tDCS and TMS may impact functionally and anatomically connected brain regions43,44, the graded functionality of every disturbed period is not guaranteed. Caution is warranted in interpreting the causal relationship between NIBS inhabitation effects and information-theoretic metrics (entropy and MI). Finally, the current study incorporated a restricted set of entropy and MI measures. The generalizability of the findings should be assessed in future studies using a more extensive range of matrices.

In summary, utilizing information-theoretic complexity metrics such as entropy and mutual information (MI), our study demonstrates that multisensory semantic processing, involving gesture and speech, gives rise to dynamically evolving representations through the interplay between gesture-primed prediction and speech presentation. This process correlates with the progressive engagement of the pMTG-IFG-pMTG circuit and various ERP components. These findings significantly advancing our understanding of the neural mechanisms underlying multisensory semantic integration.

Acknowledgements

This research was supported by grants from the STI 2030—Major Projects 2021ZD0201500, the National Natural Science Foundation of China (31822024, 31800964), the Scientific Foundation of Institute of Psychology, Chinese Academy of Sciences (E2CX3625CX), and the Strategic Priority Research Program of Chinese Academy of Sciences (XDB32010300).

Additional information

Author contributions

Conceptualization, W.Y.Z. and Y.D.; Investigation, W.Y.Z. and Z.Y.L.; Formal Analysis, W.Y.Z. and Z.Y.L.; Methodology, W.Y.Z. and Z.Y.L.; Validation, Z.Y.L. and X.L.; Visualization, W.Y.Z. and Z.Y.L. and X.L.; Funding Acquisition, W.Y.Z. and Y.D.; Supervision, Y.D.; Project administration, Y.D.; Writing – Original Draft, W.Y.Z.; Writing – Review & Editing, W.Y.Z., Z.Y.L., X.L., and Y.D.

Competing interests

The authors declare no competing interests.

Gesture description and paring with incongruent and congruent speech.

Examples of ‘an4 (press)’ for the calculation of speech entropy, gesture entropy and mutual information (MI)

Quantitative information for each stimulus.

Raw RT of semantic congruent (Sc) and semantic incongruent (Si) in Experiment 1 and Experiment 2.