Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation

Yichi Zhang*, Bryan A Pardo, Zhiyao Duan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

38 Scopus citations

Abstract

Conventional methods for finding audio in databases typically search text labels, rather than the audio itself. This can be problematic as labels may be missing, irrelevant to the audio content, or not known by users. Query by vocal imitation lets users query using vocal imitations instead. To do so, appropriate audio feature representations and effective similarity measures of imitations and original sounds must be developed. In this paper, we build upon our preliminary work to propose Siamese style convolutional neural networks to learn feature representations and similarity measures in a unified end-to-end training framework. Our Siamese architecture uses two convolutional neural networks to extract features, one from vocal imitations and the other from original sounds. The encoded features are then concatenated and fed into a fully connected network to estimate their similarity. We propose two versions of the system: IMINET is symmetric where the two encoders have an identical structure and are trained from scratch, while TL-IMINET is asymmetric and adopts the transfer learning idea by pretraining the two encoders from other relevant tasks: spoken language recognition for the imitation encoder and environmental sound classification for the original sound encoder. Experimental results show that both versions of the proposed system outperform a state-of-the-art system for sound search by vocal imitation, and the performance can be further improved when they are fused with the state of the art system. Results also show that transfer learning significantly improves the retrieval performance. This paper also provides insights to the proposed networks by visualizing and sonifying input patterns that maximize the activation of certain neurons in different layers.

Original languageEnglish (US)
Article number8453811
Pages (from-to)429-441
Number of pages13
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume27
Issue number2
DOIs
StatePublished - Feb 2019

Funding

Manuscript received May 24, 2018; revised August 16, 2018; accepted August 17, 2018. Date of publication September 3, 2018; date of current version December 6, 2018. This work was supported by the National Science Foundation under Grants 1617107 and 1617497. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Matthew E.P. Davies. (Corresponding author: Yichi Zhang.) Y. Zhang and Z. Duan are with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 USA (e-mail:, [email protected]; [email protected]). The authors would like to acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Keywords

  • Siamese style convolutional neural networks
  • Vocal imitation
  • information retrieval
  • metric learning
  • transfer learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation'. Together they form a unique fingerprint.

Cite this