High-Fidelity Neural Speech Reconstruction through an Efficient Acoustic-Linguistic Dual-Pathway Framework

  1. School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
  2. State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China
  3. Department of Electronic Engineering, Tsinghua University, Beijing, China
  4. Shanghai Artificial Intelligence Laboratory, Shanghai, China
  5. Department of Neurological Surgery, University of California, San Francisco, San Francisco, United States
  6. Shanghai Clinical Research and Trial Center, Shanghai, China
  7. Lingang Laboratory, Shanghai, China

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Nai Ding
    Zhejiang University, Hangzhou, China
  • Senior Editor
    Yanchao Bi
    Peking University, Beijing, China

Reviewer #1 (Public review):

Summary:

This paper introduces a dual-pathway model for reconstructing naturalistic speech from intracranial ECoG data. It integrates an acoustic pathway (LSTM + HiFi-GAN for spectral detail) and a linguistic pathway (Transformer + Parler-TTS for linguistic content). Output from the two components is later merged via CosyVoice2.0 voice cloning. Using only 20 minutes of ECoG data per participant, the model achieves high acoustic fidelity and linguistic intelligibility.

Strengths:

(1) The proposed dual-pathway framework effectively integrates the strengths of neural-to-acoustic and neural-to-text decoding and aligns well with established neurobiological models of dual-stream processing in speech and language.

(2) The integrated approach achieves robust speech reconstruction using only 20 minutes of ECoG data per subject, demonstrating the efficiency of the proposed method.

(3) The use of multiple evaluation metrics (MOS, mel-spectrogram R², WER, PER) spanning acoustic, linguistic (phoneme and word), and perceptual dimensions, together with comparisons against noise-degraded baselines, adds strong quantitative rigor to the study.

Weaknesses:

(1) It is unclear how much the acoustic pathway contributes to the final reconstruction results, based on Figures 3B-E and 4E. Including results from Baseline 2 + CosyVoice and Baseline 3 + CosyVoice could help clarify this contribution.

(2) As noted in the limitations, the reconstruction results heavily rely on pre-trained generative models. However, no comparison is provided with state-of-the-art multimodal LLMs such as Qwen3-Omni, which can process auditory and textual information simultaneously. The rationale for using separate models (Wav2Vec for speech and TTS for text) instead of a single unified generative framework should be clearly justified. In addition, the adaptor employs an LSTM architecture for speech but a Transformer for text, which may introduce confounds in the performance comparison. Is there any theoretical or empirical motivation for adopting recurrent networks for auditory processing and Transformer-based models for textual processing?

(3) The model is trained on approximately 20 minutes of data per participant, which raises concerns about potential overfitting. It would be helpful if the authors could analyze whether test sentences with higher or lower reconstruction performance include words that were also present in the training set.

(4) The phoneme confusion matrix in Figure 4A does not appear to align with human phoneme confusion patterns. For instance, /s/ and /z/ differ only in voicing, yet the model does not seem to confuse these phonemes. Does this imply that the model and the human brain operate differently at the mechanistic level?

(5) In general, is the motivation for adopting the dual-pathway model to better align with the organization of the human brain, or to achieve improved engineering performance? If the goal is primarily engineering-oriented, the authors should compare their approach with a pretrained multimodal LLM rather than relying on the dual-pathway architecture. Conversely, if the design aims to mirror human brain function, additional analysis, such as detailed comparisons of phoneme confusion matrices, should be included to demonstrate that the model exhibits brain-like performance patterns.

Reviewer #2 (Public review):

Summary:

The study by Li et al. proposes a dual-path framework that concurrently decodes acoustic and linguistic representations from ECoG recordings. By integrating advanced pre-trained AI models, the approach preserves both acoustic richness and linguistic intelligibility, and achieves a WER of 18.9% with a short (~20-minute) recording.

Overall, the study offers an advanced and promising framework for speech decoding. The method appears sound, and the results are clear and convincing. My main concerns are the need for additional control analyses and for more comparisons with existing models.

Strengths:

(1) This speech-decoding framework employs several advanced pre-trained DNN models, reaching superior performance (WER of 18.9%) with relatively short (~20-minute) neural recording.

(2) The dual-pathway design is elegant, and the study clearly demonstrates its necessity: The acoustic pathway enhances spectral fidelity while the linguistic pathway improves linguistic intelligibility.

Weaknesses:

The DNNs used were pre-trained on large corpora, including TIMIT, which is also the source of the experimental stimuli. More generally, as DNNs are powerful at generating speech, additional evidence is needed to show that decoding performance is driven by neural signals rather than by the DNNs' generative capacity.

Author response:

Here we provide a provisional response addressing the public comments and outlining the revisions we are planning to make:

(1) We will add additional baseline models to delineate the contributions of the acoustic and linguistic pathways.

(2) We will show additional ablation analysis and other model comparison results, as suggested by the reviewers, to justify the choice of the DNN models.

(3) We will clarify the use of the TIMIT dataset during pre-training. In fact, the TIMIT speech data (the speech corpora used in the test set) was not included or used when pre-training the acoustic or linguistic pathway. It was only used in fine-tuning the final speech synthesizer (the cosyvoice model). We will present results without this fine-tuning step, which will fully eliminate the usage of the TIMIT data during model training.

(4) We will further analyze the phoneme confusion matrices and/or other data to evaluate the model behavior.

(5) We will analyze the test sentences with high and low accuracies. We will also include results with partial training data (e.g. using 25%, 50%, 75% of the training set) to further evaluate the impact of the total amount of training data.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation