TY - JOUR
T1 - Context-Aware Prosody Correction for Text-Based Speech Editing
AU - Morrison, Max
AU - Rencker, Lucas
AU - Jin, Zeyu
AU - Bryan, Nicholas J.
AU - Caceres, Juan Pablo
AU - Pardo, Bryan
N1 - Funding Information:
This work was carried out during an internship at Adobe Research.
Publisher Copyright:
© 2021 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding textbased editing of speech. To do so, we 1) use a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control 2) use the generated features to control a standard pitch-shift and time-stretch method and 3) apply a denoising neural network to remove artifacts induced by the signal manipulation to yield a highfidelity result. We evaluate our approach using a subjective listening test, provide a detailed comparative analysis, and conclude several interesting insights.
AB - Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding textbased editing of speech. To do so, we 1) use a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control 2) use the generated features to control a standard pitch-shift and time-stretch method and 3) apply a denoising neural network to remove artifacts induced by the signal manipulation to yield a highfidelity result. We evaluate our approach using a subjective listening test, provide a detailed comparative analysis, and conclude several interesting insights.
KW - Deep learning
KW - Pitch-shifting
KW - Prosody generation
KW - Speech
KW - Time-stretching
UR - http://www.scopus.com/inward/record.url?scp=85115182290&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115182290&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414633
DO - 10.1109/ICASSP39728.2021.9414633
M3 - Conference article
AN - SCOPUS:85115182290
VL - 2021-June
SP - 7038
EP - 7042
JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
SN - 0736-7791
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -