TY - GEN
T1 - Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures
AU - Seetharaman, Prem
AU - Wichern, Gordon
AU - Le Roux, Jonathan
AU - Pardo, Bryan
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Such systems are normally trained on sound mixtures where the ground truth decomposition is already known. In this work, we use an unsupervised spatial source separation on stereo mixtures which generates initial decompositions of mixtures to train a deep learning source separation model. These estimated decompositions vary greatly in quality across the training mixtures. To overcome this, we weight the data during training using a confidence measure that assesses which mixtures or parts of mixtures are well-separated by the unsupervised algorithm. Once trained, the model can be applied to separate single-channel mixtures, where no source direction information is available. The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions. We also explore using the two approaches in an ensemble.
AB - Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Such systems are normally trained on sound mixtures where the ground truth decomposition is already known. In this work, we use an unsupervised spatial source separation on stereo mixtures which generates initial decompositions of mixtures to train a deep learning source separation model. These estimated decompositions vary greatly in quality across the training mixtures. To overcome this, we weight the data during training using a confidence measure that assesses which mixtures or parts of mixtures are well-separated by the unsupervised algorithm. Once trained, the model can be applied to separate single-channel mixtures, where no source direction information is available. The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions. We also explore using the two approaches in an ensemble.
KW - audio source separation
KW - auditory scene analysis
KW - cocktail party problem
KW - deep clustering
KW - noisy learning
UR - http://www.scopus.com/inward/record.url?scp=85068958954&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068958954&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8683198
DO - 10.1109/ICASSP.2019.8683198
M3 - Conference contribution
AN - SCOPUS:85068958954
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 356
EP - 360
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -