Crowdsourced LibriTTS Speech Prominence Annotations

  • Max Morrison (Creator)
  • Max Morrison (Contributor)
  • Pranav Pawar (Contributor)
  • Nathan Pruyne (Contributor)
  • Jennifer S Cole (Contributor)
  • Bryan Pardo (Contributor)



Dataset corresponding to the ICASSP 2024 paper "Crowdsourced and Automatic Speech Prominence Estimation" [link] This dataset is useful for training machine learning models to perform automatic emphasis annotaiton, as well as downstream tasks such as emphasis-controlled TTS, emotion recognition, and text summarization. The dataset is described in Section 3 (Emphasis Annotation Dataset). The contents of this section are copied below for convenience. We used our crowdsourced annotation system to perform human annotation on one eighth of the train-clean-100 partition of the LibriTTS [1] dataset. Specifically, participants annotated 3,626 utterances with a total length of 6.42 hours and 69,809 words from 18 speakers (9 male and 9 female). We collected at least one annotation of all 3,626 utterances, at least two annotations of 2,259 of those utterances, at least four annotations of 974 utterances, and at least eight annotations of 453 utterances. We did this in order to explore (in Section 6) whether it is more cost-effective to train a system on multiple annotations of fewer utterances or fewer annotations of more utterances. We paid 298 annotators to annotate batches of 20 utterances, where each batch takes approximately 15 minutes. We paid $3.34 for each completed batch (estimated $13.35 per hour). Annotators each annotated between one and six batches. We recruited on MTurk US residents with an approval rating of at least 99 and at least 1000 approved tasks. Today, microlabor platforms like MTurk are plagued by automated task-completion software agents (bots) that randomly fill out surveys. We filtered out bots by excluding annotations from an additional 107 annotators that marked more than 2/3 of words as emphasized in eight or more utterances of the 20 utterances in a batch. Annotators who fail the bot filter are blocked from performing further annotation. We also recorded participants' native country and language, but note these may be unreliable as many MTurk workers use VPNs to subvert IP region filters on MTurk [2]. The average Cohen Kappa score for annotators with at least one overlapping utterance is 0.226 (i.e., ``Fair'' agreement)---but not all annotators annotate the same utterances, and this overemphasizes pairs of annotators with low overlap. Therefore, we use a one-parameter logistic model (i.e., a Rasch model) computed via py-irt [3], which predicts heldout annotations from scores of overlapping annotators with 77.7% accuracy (50% is random). The structure of this dataset is a single JSON file of word-aligned emphasis annotations. The JSON references file stems of the LibriTTS dataset, which can be found here. All code used in the creation of the dataset can be found here. The format of the JSON file is as follows. {
: {
"annotations": [
"score": [
, ,
"stem": ,
"words": [


], ...
}, ...
"country": ,
} [1] Zen et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019. [2] Moss et al., “Bots or inattentive humans? Identifying sources of low-quality data in online platforms,” PsyArXiv preprint PsyArXiv:wr8ds, 2021. [3] John Patrick Lalor and Pedro Rodriguez, “py-irt: A scalable item response theory library for Python,” INFORMS Journal on Computing, 2023.
Date made availableDec 18 2023

Cite this