Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation

Yichi Zhang*, Bryan A Pardo, Zhiyao Duan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Scopus citations


Conventional methods for finding audio in databases typically search text labels, rather than the audio itself. This can be problematic as labels may be missing, irrelevant to the audio content, or not known by users. Query by vocal imitation lets users query using vocal imitations instead. To do so, appropriate audio feature representations and effective similarity measures of imitations and original sounds must be developed. In this paper, we build upon our preliminary work to propose Siamese style convolutional neural networks to learn feature representations and similarity measures in a unified end-to-end training framework. Our Siamese architecture uses two convolutional neural networks to extract features, one from vocal imitations and the other from original sounds. The encoded features are then concatenated and fed into a fully connected network to estimate their similarity. We propose two versions of the system: IMINET is symmetric where the two encoders have an identical structure and are trained from scratch, while TL-IMINET is asymmetric and adopts the transfer learning idea by pretraining the two encoders from other relevant tasks: spoken language recognition for the imitation encoder and environmental sound classification for the original sound encoder. Experimental results show that both versions of the proposed system outperform a state-of-the-art system for sound search by vocal imitation, and the performance can be further improved when they are fused with the state of the art system. Results also show that transfer learning significantly improves the retrieval performance. This paper also provides insights to the proposed networks by visualizing and sonifying input patterns that maximize the activation of certain neurons in different layers.

Original languageEnglish (US)
Article number8453811
Pages (from-to)429-441
Number of pages13
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Issue number2
StatePublished - Feb 2019


  • Siamese style convolutional neural networks
  • Vocal imitation
  • information retrieval
  • metric learning
  • transfer learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation'. Together they form a unique fingerprint.

Cite this