Figures and data

The neural data acquisition and acoustic-linguistic dual-pathway framework for neural-driven natural speech re-synthesis.
(A) The neural data acquisition. We collected ECoG data from 9 monolingual native English participants as they each listened to English sentences from the TIMIT corpus, resulting in 20 minutes of neural recording data per participant. Furthermore, for each participant, 70%, 20%, and 10% of the recorded neural activities are respectively allocated to the training, validation, and test sets randomly. (B) The acoustic-linguistic dual-pathway framework. The acoustic pathway consists of two stages: Stage 1: A HiFi-GAN generator is pre-trained to synthesize natural speech from features extracted by the frozen Wav2Vec2.0 encoder, using a multi-receptive field fusion module and adversarial training with discriminators. This stage uses the LibriSpeech corpus to enhance speech representation learning. Stage 2: A lightweight LSTM adaptor maps neural activity to speech representations, enabling the frozen HiFi-GAN generator to re-synthesize high acoustic fidelity speech from neural data. The linguistic pathway involves a Transformer adaptor refining neural features to align with word tokens, which are fed into the frozen Parler-TTS model to generate high linguistic intelligibility speech. The voice cloning stage uses CosyVoice 2.0 (fine-tuned on TIMIT) to clone the speaker’s voice, ensuring the ultimate re-synthesized speech waveform matches the original stimuli in clarity and voice characteristics.

Examples of neural-driven speech reconstruction performance.
(A) Ground truth and decoded speech comparisons. Waveforms (time vs. amplitude) and mel-spectrograms (time vs. frequency range 0-8 kHz) are shown for both the original (top) and reconstructed (bottom) speech samples, demonstrating preserved spectral-temporal patterns in the neural-decoded output. Decoding demonstration of phonemes and words is attached to the speech. (B) Mel-spectrogram correlation analysis. The KDE curve shows the distribution of aligned correlation coefficients between the original and reconstructed mel-spectrograms across all test samples, while the bar chart represents the percentage of trials in the test set (mean = 0.824 ± 0.028). Higher values reflect better acoustic feature preservation in the time-frequency domain. Purple dashed lines (light to dark) show average R2 after adding −10 dB, −5 dB, and 0 dB additive noise to the original speech. (C) Human subject evaluations. The KDE curve displays the distribution of human evaluation results, while the bar chart represents the percentage of trials in the test set (mean = 3.956 ± 0.175). Higher values reflect better intelligibility. Purple dashed lines (light to dark) show the average MOS after adding −20 dB, −10 dB, and 0 dB additive noise to the original speech. (D-E) Word Error Rate (WER) and Phoneme Error Rate (PER) assessment.

Performance comparison among re-synthesized speech and MLP-regression, acoustic and linguistic baselines.
(A) Speech waveform and mel-spectrogram observations. Depicted are waveforms (time vs. amplitude) and mel-spectrograms (time vs. frequency ranging 0-8 kHz) for illustrative speech samples (Ground truth, MLP regression and acoustic-linguistic pathway intermediate outputs as baseline 1-3, and our ultimate re-synthesized natural speech). (B) Objective evaluation using mel-spectrogram R2. Violin plots show the aggregated distribution of R2 scores (0-1 scale) assessing spectral fidelity. The white dot represents the median, the box spans the interquartile range, and whiskers extend to ±1.5×IQR. Three dashed lines (light to dark) show average R2 after adding −10 dB, −5 dB, and 0 dB additive noise to the original speech. (C) Subjective quality evaluation using mean opinion score (MOS). Violin plots show aggregated MOS distribution (1-5 scale). Three dashed lines (light to dark) indicate average MOS after adding −20 dB, −10 dB, and 0 dB additive noise. GT: Ground truth. (D) Intelligibility assessment using word error rate (WER). Violin plots show aggregated WER distribution (0-1 scale). Three purple dashed lines (light to dark) show the average WERs after adding −10 dB, −5 dB, and 0 dB additive noise. (E) Phoneme error rate (PER) assessment. Format and noise conditions identical to panel D. Statistical significance markers: *:p<0.01, *:*p<0.001, n.s.: not significant.

Phoneme recognition performance: confusion matrices and accuracy comparison
(A) Confusion matrix between transcribed phoneme sequences using our proposed model (horizontal axis) and ground truth (vertical axis). Diagonal values represent recognition accuracy for each phoneme (correct matches), while off-diagonal non-zero elements indicate substitution errors. Empty rows denote insertion errors (extraneous phonemes), empty columns indicate deletion errors (missing phonemes). All non-zero elements are highlighted for visual clarity. (B-D) Phoneme confusion matrix using baseline 1-3 for phoneme transcription. (E) Phoneme class clarity across methods. The violin plots show the proportion of correctly identified phonemes for each symbol. Asterisks (*) indicate phonemes where our proposed framework significantly outperforms baselines.