High-fidelity neural speech reconstruction through an efficient acoustic-linguistic dual-pathway framework

  1. Jiawei Li
  2. Chunxu Guo
  3. Chao Zhang
  4. Edward F Chang
  5. Yuanning Li  Is a corresponding author
  1. School of Biomedical Engineering, ShanghaiTech University, China
  2. State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, China
  3. Department of Electronic Engineering, Tsinghua University, China
  4. Shanghai Artificial Intelligence Laboratory, China
  5. Department of Neurological Surgery, University of California, San Francisco, United States
  6. Shanghai Clinical Research and Trial Center, China
  7. Lin Gang Laboratory, China
4 figures and 5 additional files

Figures

Figure 1 with 1 supplement
The neural data acquisition and acoustic-linguistic dual-pathway framework for neural-driven natural speech re-synthesis.

(A) The neural data acquisition. We collected electrocorticography (ECoG) data from nine monolingual native English participants as they each listened to English sentences from the TIMIT corpus, resulting in 20 min of neural recording data per participant. Furthermore, for each participant, 70%, 20%, and 10% of the recorded neural activities are, respectively, allocated to the training, validation, and test sets randomly. (B) The acoustic-linguistic dual-pathway framework. The acoustic pathway consists of two stages: Stage 1: A high-fidelity generative adversarial network (HiFi-GAN) generator is pre-trained to synthesize natural speech from features extracted by the frozen Wav2Vec2.0 encoder, using a multi-receptive field fusion module and adversarial training with discriminators. This stage uses the LibriSpeech corpus to enhance speech representation learning. Stage 2: A lightweight long-short term memory (LSTM) adaptor maps neural activity to speech representations, enabling the frozen HiFi-GAN generator to re-synthesize high acoustic fidelity speech from neural data. The linguistic pathway involves a Transformer adaptor refining neural features to align with word tokens, which are fed into the frozen Parler-text-to-speech (TTS) model to generate high linguistic intelligibility speech. The voice cloning stage uses CosyVoice 2.0 (fine-tuned on TIMIT) to clone the speaker’s voice, ensuring the ultimate re-synthesized speech waveform matches the original stimuli in clarity and voice characteristics.

Figure 1—figure supplement 1
Speech-responsive and tone-discriminating electrodes for all participants.

Electrocorticography (ECoG) grids covering the lateral temporal lobe of all participants were warped onto the MNI152 template. Yellow electrodes are responsive to speech, while blue electrodes are not.

Figure 2 with 1 supplement
Neural-driven speech reconstruction performance.

(A) Ground truth and decoded speech comparisons. Waveforms (time vs. amplitude) and mel-spectrograms (time vs. frequency range 0–8 kHz) are shown for both the original (top) and reconstructed (bottom) speech samples, demonstrating preserved spectral-temporal patterns in the neural-decoded output. Decoding demonstration of phonemes and words is attached to the speech. (B) Mel-spectrogram correlation analysis. The KDE curve shows the distribution of aligned correlation coefficients between the original and reconstructed mel-spectrograms across all test samples, while the bar chart represents the percentage of trials in the test set (mean = 0.824±0.028). Higher values reflect better acoustic feature preservation in the time-frequency domain. Purple dashed lines (light to dark) show average R² after adding –10 dB, –5 dB, and 0 dB additive noise to the original speech. (C) Human subject evaluations. The KDE curve displays the distribution of human evaluation results, while the bar chart represents the percentage of trials in the test set (mean = 3.956±0.175). Higher values reflect better intelligibility. Purple dashed lines (light to dark) show the average mean opinion score (MOS) after adding –20 dB, –10 dB, and 0 dB additive noise to the original speech. (D–E) Word Error Rate (WER) and Phoneme Error Rate (PER) assessment. evaluation. The KDE curve displays distribution, while the bar chart represents the percentage of trials in the test set (mean WER = 0.189±0.033, mean PER = 0.120±0.025). Lower values indicate better word-level reconstruction accuracy. Purple dashed lines (light to dark) show average WER after adding –10 dB, –5 dB, and 0 dB additive noise to the original speech.

Figure 2—figure supplement 1
An example of original and re-synthesized speech in all stages.
Figure 3 with 3 supplements
Performance comparison among re-synthesized speech and MLP regression, acoustic, and linguistic baselines.

(A) Speech waveform and mel-spectrogram observations. Depicted are waveforms (time vs. amplitude) and mel-spectrograms (time vs. frequency ranging 0–8 kHz) for illustrative speech samples (Ground truth, MLP regression, and acoustic-linguistic pathway intermediate outputs as baseline 1–3, and our ultimate re-synthesized natural speech). (B) Objective evaluation using mel-spectrogram R². Violin plots show the aggregated distribution of R² scores (0–1 scale) assessing spectral fidelity. The white dot represents the median, the box spans the interquartile range, and whiskers extend to ±1.5× IQR. Three dashed lines (light to dark) show average R² after adding –10 dB, –5 dB, and 0 dB additive noise to the original speech. (C) Subjective quality evaluation using mean opinion score (MOS). Violin plots show aggregated MOS distribution (1–5 scale). Three dashed lines (light to dark) indicate average MOS after adding –20 dB, –10 dB, and 0 dB additive noise. Note: The higher MOS for Baseline 3 (linguistic pathway) compared to ground truth is due to the superior acoustic quality of the modern speech corpus used to train Parler-text-to-speech (TTS), whereas the TIMIT corpus contains inherent noise. GT: Ground truth. (D) Intelligibility assessment using word error rate (WER). Violin plots show aggregated WER distribution (0–1 scale). Three purple dashed lines (light to dark) show the average WERs after adding –10 dB, –5 dB, and 0 dB additive noise. (E) Phoneme error rate (PER) assessment. Format and noise conditions identical to panel D. Statistical significance markers: *p<0.01, **p<0.001, n.s.: not significant.

Figure 3—figure supplement 1
Performance comparison between re-synthesized speech and baselines for each participant.

(A) Speech waveform and mel-spectrogram observations. Depicted are waveforms (time vs. amplitude) and mel-spectrograms (time vs. frequency ranging 0–8 kHz) for illustrative speech samples (Ground truth, baseline 1–3, and our ultimate-re-synthesized natural speech). (B) Objective evaluation using mel-spectrogram R². Violin plots show the distribution of R² scores (0–1 scale) assessing spectral fidelity. The white dot represents the median, the box spans the interquartile range (25th to 75th percentiles), and whiskers extend to +/−1.5 times the interquartile range, and the violin width illustrates data density at each point on the y-axis. The three shades of dashed lines from light to dark represent the average mel-spectrogram R² after adding additive noise at –10 dB, –5 dB, and 0 dB to the original speech waveform. The arrow on the right represents the direction of better results. P1-P9: participants. (C) Subjective quality evaluation using mean opinion score (MOS). Violin plots show the distribution of MOS ratings (1–5 scale) assessing speech quality. The white dot represents the median, the box spans the interquartile range (25th to 75th percentiles), and whiskers extend to +/−1.5 times the interquartile range, and the violin width illustrates data density at each point on the y-axis. The three shades of dashed lines from light to dark represent the average MOS scores after adding additive noise at –20 dB, –10 dB, and 0 dB to the original speech waveform. The arrow on the right represents the direction of better results. GT: Ground truth (original speech). P1-P9: participants. (D) Intelligibility assessment using word error rate (WER). Violin plots show the distribution of WER scores (0–1 scale) assessing speech recognition accuracy. The white dot represents the median, the box spans the interquartile range (25th to 75th percentiles), and whiskers extend to +/−1.5 times the interquartile range, and the violin width illustrates data density at each point on the y-axis. The three shades of purple lines from light to dark represent the average WER scores after adding additive noise at –10 dB, –5 dB, and 0 dB to the original speech waveform. The arrow on the left represents the direction of better results (lower WER). P1-P9: participants. (E) Similar to panel D, but evaluated using phoneme error rate (PER).

Figure 3—figure supplement 2
Model performance scales with the amount of training data.

(A) Mel-spectrogram correlation (R²) between reconstructed and original speech. (B) Word Error Rate (WER). (C) Phoneme Error Rate (PER). All metrics are shown under the percentages of the used training set (25%, 50%, 75%, and 100%).

Figure 3—figure supplement 3
Control analysis: Sensitivity of the acoustic pathway’s reconstruction to neural input.

(A) High-fidelity speech waveform (top) and corresponding mel-spectrogram (bottom) reconstructed from the full, veridical electrocorticography (ECoG) input by the acoustic pathway (Baseline 2). (B–D) Reconstructions when portions of the input ECoG signal are replaced with Gaussian noise of matched dimensionality. (B) Replacing the first half of the ECoG signal with noise. (C) Replacing the second half of the ECoG signal with noise. (D) Replacing the entire ECoG signal with noise. (E) Quantitative comparison of acoustic fidelity. A bar plot shows the mel-spectrogram R² values (mean ± s.e.m. across test samples) for the four conditions depicted in panels A-D: (1) Full ECoG input (A), (2) First half replaced with noise (B), (3) Second half replaced with noise (C), and (4) Entire signal replaced with noise (D). The significant drop in R² for conditions involving noise replacement confirms that reconstruction quality is causally dependent on and temporally locked to the veridical neural signal.

Phoneme recognition performance: confusion matrices and accuracy comparison.

(A) Confusion matrix between transcribed phoneme sequences using our proposed model (horizontal axis) and ground truth (vertical axis). Diagonal values represent recognition accuracy for each phoneme (correct matches), while off-diagonal non-zero elements indicate substitution errors. Empty rows denote insertion errors (extraneous phonemes), empty columns indicate deletion errors (missing phonemes). All non-zero elements are highlighted for visual clarity. (B–D) Phoneme confusion matrix using baseline 1–3 for phoneme transcription. (E) Phoneme class clarity (PCC) across methods. PCC measures the proportion of mis-decoded phonemes that are confused within the same class (vowel-vowel or consonant-consonant) versus across classes (vowel-consonant). A higher PCC indicates that errors tend to be phonologically similar sounds, which supports intelligibility. The comparable PCC between our final integrated model and Baseline 3 (linguistic pathway) suggests that the phoneme-level error structure of our output is largely inherited from the high-quality linguistic prior embedded in the pre-trained text-to-speech (TTS) model (Parler-TTS). The violin plots show the distribution of PCC values across test samples. Asterisks (*) indicate phonemes where our proposed framework significantly outperforms baselines.

Additional files

Supplementary file 1

Ablation study on adaptor architecture for the acoustic pathway.

Performance is evaluated by the mel-spectrogram R² (mean ± s.e.m. across participants), measuring the fidelity of reconstructed acoustic features. The bidirectional long-short term memory (LSTM) adaptor consistently outperformed the Transformer-based adaptor across different layer depths. The optimal performance was achieved with a 3-layer LSTM, which was selected for the final model.

https://cdn.elifesciences.org/articles/109400/elife-109400-supp1-v1.docx
Supplementary file 2

Ablation study on adaptor architecture for the linguistic pathway.

Performance is evaluated by word error rate (WER) and phoneme error rate (PER) (mean ± s.e.m. across participants), measuring the intelligibility of reconstructed speech. The Transformer-based adaptor achieved lower error rates than the long-short term memory (LSTM)-based adaptor across nearly all layer configurations. The optimal performance was achieved with a 3-layer Transformer, which was selected for the final model.

https://cdn.elifesciences.org/articles/109400/elife-109400-supp2-v1.docx
Supplementary file 3

Comparative overview of recent studies in neural-driven speech decoding and re-synthesis.

The table summarizes representative work, highlighting the neural recording modality, approximate amount of data used for decoder training per subject, the primary experimental task (perception or production), and reported performance metrics. Studies are ordered chronologically. Performance metrics include: Mean Opinion Score (MOS, scale 1–5), Extended Short-Time Objective Intelligibility (ESTOI, scale 0–1), Word Error Rate (WER, %), Phoneme Error Rate (PER, %), and mel-spectrogram correlation (R², scale 0–1). Note that direct numerical comparisons should be made with caution due to differences in neural signals, tasks, stimuli, and evaluation methodologies across studies. Our study (highlighted in bold) achieves a competitive balance between data efficiency (~20 min) and performance across multiple metrics (WER, PER, MOS, R²).

https://cdn.elifesciences.org/articles/109400/elife-109400-supp3-v1.docx
MDAR checklist
https://cdn.elifesciences.org/articles/109400/elife-109400-mdarchecklist1-v1.docx
Source data 1

The statistical source data for all figures and supplementary figures.

https://cdn.elifesciences.org/articles/109400/elife-109400-data1-v1.xlsx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jiawei Li
  2. Chunxu Guo
  3. Chao Zhang
  4. Edward F Chang
  5. Yuanning Li
(2026)
High-fidelity neural speech reconstruction through an efficient acoustic-linguistic dual-pathway framework
eLife 14:RP109400.
https://doi.org/10.7554/eLife.109400.3