Speech synthesis from ECoG using densely connected 3D convolutional neural networks

Miguel Angrick, Christian Herff, Emily Mugler, Matthew Christopher Tate, Marc W Slutzky, Dean J. Krusienski, Tanja Schultz

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.

Original languageEnglish (US)
Article number036019
JournalJournal of Neural Engineering
Volume16
Issue number3
DOIs
StatePublished - Jan 1 2019

Fingerprint

Speech synthesis
Neural networks
Acoustics
Decoding
Linear Models
Brain
Communication
Topology

Keywords

  • BCI
  • Brain-computer interfaces
  • Electrocorticography
  • Neural networks
  • Speech synthesis
  • Wavenet

ASJC Scopus subject areas

  • Biomedical Engineering
  • Cellular and Molecular Neuroscience

Cite this

Angrick, Miguel ; Herff, Christian ; Mugler, Emily ; Tate, Matthew Christopher ; Slutzky, Marc W ; Krusienski, Dean J. ; Schultz, Tanja. / Speech synthesis from ECoG using densely connected 3D convolutional neural networks. In: Journal of Neural Engineering. 2019 ; Vol. 16, No. 3.
@article{4de617c2f6b14131b96757622e226fe9,
title = "Speech synthesis from ECoG using densely connected 3D convolutional neural networks",
abstract = "Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.",
keywords = "BCI, Brain-computer interfaces, Electrocorticography, Neural networks, Speech synthesis, Wavenet",
author = "Miguel Angrick and Christian Herff and Emily Mugler and Tate, {Matthew Christopher} and Slutzky, {Marc W} and Krusienski, {Dean J.} and Tanja Schultz",
year = "2019",
month = "1",
day = "1",
doi = "10.1088/1741-2552/ab0c59",
language = "English (US)",
volume = "16",
journal = "Journal of Neural Engineering",
issn = "1741-2560",
publisher = "IOP Publishing Ltd.",
number = "3",

}

Speech synthesis from ECoG using densely connected 3D convolutional neural networks. / Angrick, Miguel; Herff, Christian; Mugler, Emily; Tate, Matthew Christopher; Slutzky, Marc W; Krusienski, Dean J.; Schultz, Tanja.

In: Journal of Neural Engineering, Vol. 16, No. 3, 036019, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Speech synthesis from ECoG using densely connected 3D convolutional neural networks

AU - Angrick, Miguel

AU - Herff, Christian

AU - Mugler, Emily

AU - Tate, Matthew Christopher

AU - Slutzky, Marc W

AU - Krusienski, Dean J.

AU - Schultz, Tanja

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.

AB - Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.

KW - BCI

KW - Brain-computer interfaces

KW - Electrocorticography

KW - Neural networks

KW - Speech synthesis

KW - Wavenet

UR - http://www.scopus.com/inward/record.url?scp=85063040748&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063040748&partnerID=8YFLogxK

U2 - 10.1088/1741-2552/ab0c59

DO - 10.1088/1741-2552/ab0c59

M3 - Article

VL - 16

JO - Journal of Neural Engineering

JF - Journal of Neural Engineering

SN - 1741-2560

IS - 3

M1 - 036019

ER -