Finding ways to automatically index, label, and access multimedia content (such as audio documents) that do not have text tags indexing subportions of the audio is an open research question that is increasing in importance as multimedia repositories proliferate and grow. Soundcloud is one example online repository. It contains recordings of bands, of sound effects, podcasts, etc. and contributors upload 12 hours of audio every minute. This audio is indexed by the authors at the time of contribution with short text labels that describe the full track, which may be minutes or hours long. Search for a desired piece of audio using these labels can be problematic. We propose to develop methods and a system for audio search via query-by-example, where the example is similar, in some way, to the desired audio in the database, but is not an exact match. One type of query by example is vocal imitation. In this scenario, the query example may vary from the desired target on many dimensions. It is important, therefore, to find representations that emphasize the dimensions along which the query resembles the target sound. When the query is a vocal imitation, the query sound can lie in a very constrained sound space compared to the sounds to be retrieved, due to the physical constraints of the human vocal system. Building a successful system, therefore, will require research into representations of audio, methods for matching and retrieving audio based on queries that are similar to target sounds only on a subset of their measurable dimensions, and interfaces that facilitate providing queries and refining search results. The expected outcomes of this research are: • Algorithms able to learn a good mid-level representation for audio that emphasizes the aspects of sound where queries and target sounds are most similar. • New interaction methods to allow search using vocal imitations and sound examples and iterative search refinement • An open-source sound retrieval system that embodies these outcomes • A large vocal imitation and sound data The PIs at both institutions are well qualified to conduct this research and have a strong publication record in audio source separation algorithms, machine learning for audio, audio interfaces that learn from user interaction, and measuring perceptual correlates of audio signals. This research is of intellectual interest to a variety of fields beyond signal processing and human-computer interaction. Researchers in artificial intelligence (AI) are interested in systems that can parse and understand the perceptual world. Researchers in speech recognition, audio perception and computational auditory scene analysis are especially interested in the ability to understand the audio scene. Algorithms for search and retrieval audio are also of great interest to the multimedia information retrieval community. The proposed work will have broader impact in any field that requires search through audio database. These include biodiversity monitoring (e.g. automatic ID of bird species in field recordings of birdsongs and search through existing audio/video collections where it is currently impractical to hand-label the content of the data with searchable tags. Other application areas include diagnosis from audio example. Callers to National Public Radio’s well-known “Car Talk” show typically vocalize the sound of their ailing auto to help the hosts of the show diagnose their problem. One could imagine a database of typical auto sounds and problems that would let one search by vocal imitation or a field recording of
|Effective start/end date||9/1/16 → 8/31/20|
- National Science Foundation (IIS-1617497)
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.