A Novel Phoneme-based Modeling for Text-independent Speaker Identification

Xin Wang, Chuan Xie, Qiang Wu, Huayi Zhan, Ying Wu

Research output: Contribution to journalConference articlepeer-review

Abstract

Text-independent speaker identification attracted growing attention while it remains challenging to extract speaker-specific features from a speech with arbitrary content. End-to-end systems trained with utterance-level features suffer from performance degradation caused by speech content variation. To address this issue, this paper proposes a novel phoneme-based approach with the following key features: first, it restricts the variety of speech content by splitting each utterance into a set of phoneme segments and develops the phoneme-constrained models to extract segment-level embeddings of speakers; second, it leverages a soft-voting mechanism with mono-phonemic thresholds and weights to combine the results of different phonemes. Experimental results on AISHELL and ASRU2019 datasets show that the proposed approach is effective and robust, which outperforms the state-of-the-art methods in both EER and accuracy, especially with a larger phonemic mismatch between the enrollment and test utterances. In addition, the proposed system is efficient that can be trained well on a small-scale dataset.

Original languageEnglish (US)
Pages (from-to)4775-4779
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
StatePublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: Sep 18 2022Sep 22 2022

Keywords

  • phoneme-based models
  • segment-level feature extraction
  • text-independent speaker identification

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'A Novel Phoneme-based Modeling for Text-independent Speaker Identification'. Together they form a unique fingerprint.

Cite this