TY - JOUR
T1 - A human-in-the-loop system for sound event detection and annotation
AU - Kim, Bongjun
AU - Pardo, Bryan A
N1 - Funding Information:
The reviewing of this article was managed by special issue associate editors Marco Gillies and Rebecca Fiebrink. This work was supported by National Science Foundation Grant 1617497. Authors’ addresses: B. Kim, Northwestern University, 2133 Sheridan Road, Evanston, IL, 60208, USA; email: bongjun@ u.northwestern.edu; B. Pardo, Northwestern University, 2133 Sheridan Road, Evanston, IL, 60208, USA; email: pardo@ northwestern.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 2160-6455/2018/06-ART13 $15.00 https://doi.org/10.1145/3214366
Publisher Copyright:
© 2018 ACM
PY - 2018/7
Y1 - 2018/7
N2 - Labeling of audio events is essential for many tasks. However, finding sound events and labeling them within a long audio file is tedious and time-consuming. In cases where there is very little labeled data (e.g., a single labeled example), it is often not feasible to train an automatic labeler because many techniques (e.g., deep learning) require a large number of human-labeled training examples. Also, fully automated labeling may not show sufficient agreement with human labeling for many uses. To solve this issue, we present a human-in-the-loop sound labeling system that helps a user quickly label target sound events in a long audio. It lets a user reduce the time required to label a long audio file (e.g., 20 hours) containing target sounds that are sparsely distributed throughout the recording (10% or less of the audio contains the target) when there are too few labeled examples (e.g., one) to train a state-of-the-art machine audio labeling system. To evaluate the effectiveness of our tool, we performed a human-subject study. The results show that it helped participants label target sound events twice as fast as labeling them manually. In addition to measuring the overall performance of the proposed system, we also measure interaction overhead and machine accuracy, which are two key factors that determine the overall performance. The analysis shows that an ideal interface that does not have interaction overhead at all could speed labeling by as much as a factor of four.
AB - Labeling of audio events is essential for many tasks. However, finding sound events and labeling them within a long audio file is tedious and time-consuming. In cases where there is very little labeled data (e.g., a single labeled example), it is often not feasible to train an automatic labeler because many techniques (e.g., deep learning) require a large number of human-labeled training examples. Also, fully automated labeling may not show sufficient agreement with human labeling for many uses. To solve this issue, we present a human-in-the-loop sound labeling system that helps a user quickly label target sound events in a long audio. It lets a user reduce the time required to label a long audio file (e.g., 20 hours) containing target sounds that are sparsely distributed throughout the recording (10% or less of the audio contains the target) when there are too few labeled examples (e.g., one) to train a state-of-the-art machine audio labeling system. To evaluate the effectiveness of our tool, we performed a human-subject study. The results show that it helped participants label target sound events twice as fast as labeling them manually. In addition to measuring the overall performance of the proposed system, we also measure interaction overhead and machine accuracy, which are two key factors that determine the overall performance. The analysis shows that an ideal interface that does not have interaction overhead at all could speed labeling by as much as a factor of four.
KW - Human-in-the-loop system
KW - Interactive machine learning
KW - Sound event detection
UR - http://www.scopus.com/inward/record.url?scp=85068465689&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068465689&partnerID=8YFLogxK
U2 - 10.1145/3214366
DO - 10.1145/3214366
M3 - Article
AN - SCOPUS:85068465689
SN - 2160-6455
VL - 8
JO - ACM Transactions on Interactive Intelligent Systems
JF - ACM Transactions on Interactive Intelligent Systems
IS - 2
M1 - 13
ER -