Fine-Grained and Interpretable Neural Speech Editing

Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Fine-grained editing of speech attributes-such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants-is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.

Original languageEnglish (US)
Pages (from-to)187-191
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: Sep 1 2024Sep 5 2024

Keywords

  • control
  • editing
  • interpretable
  • representation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Fine-Grained and Interpretable Neural Speech Editing'. Together they form a unique fingerprint.

Cite this