An example of audio representations. The top plot depicts the time-series representation, while the bottom plot depicts the spectrogram (time-frequency) representation. The auditory scene consists of a small jazz band (consisting of saxophone, piano, drums, and bass), chatter produced by a crowd of people, and dishes clattering as they are being washed. It is difficult to visually separate the three sources from each other in either the time-frequency representation or the original time-series.
An example of time-frequency assignments in the spectrogram from Figure 1.2. The auditory scene consists of a small jazz band consisting of saxophone, piano, drums, and bass, chatter produced by a crowd of people, and dishes clattering as they are being washed. Dark blue represents silence, light blue represents the jazz band, red the chatter, and yellow the dishes.
An example of audio representations and mixing sources. In the top image, a spectrogram for a single audio source consisting of a piano playing 3 notes is shown. In the second, an example of speech. In the third image, a crumpling plastic cup. Finally, the last image has the time-frequency representation of the three sources mixed together. It is very hard to disambiguate the three sources from one another in the image, especially in the times where the sources overlap in activity.
Non-negative matrix factorization (NMF) applied to a mixture of two marimbas playing simultaneously. The two marimbas have different timbres. One is played with a soft mallet, resulting in a more harmonic tone, while the other uses a hard mallet, resulting in a more percussive tone. Applying NMF to the mixture and clustering the basis set by timbre recovers the two sources successfully.
Non-negative matrix factorization (NMF) applied to a different mixture of two marimbas playing simultaneously. The two marimbas have the same timbre, but are separated in pitch. NMF fails to separate the two marimbas. The first source it discovers is merely the "attack" (the percussive start of each note), while the other source is the "sustain" (the harmonic horizontal component of each note). The two sources have similar characteristics, causing the parts-based decomposition approach to fail.
Output of primitive separation algorithms on different kinds of mixtures. In the top row, the constituent sources are shown. The goal is to extract the target sound, shown in the top right. When the target is mixed with repeating interference (first column), a repetition-based algorithm works. When the target is mixed with non-repetitive interference (second column), the algorithm fails. Finally, a direction-of-arrival-based algorithm (third column) works on the same non-repetitive mixture, as the two sources are spatially separated in the auditory scene.
Output of a deep model on different mixtures. The model here is state-of-the-art and is trained to separate speech mixtures. In the top row are the ground truth sources used to create the mixtures in the second row. They are recordings of speech, whistling, and clapping. On the speech + speech recording, the deep model performs well (first column), as it was trained for this task. However, on the other two mixtures, it fails to separate the scene into coherent sources, mixing up the whistle and speech (second column) and the whistle and clapping (third column).
An auditory scene (second panel from top) consisting of whistling, drums and speech sounds. In the first part of the mixture, there is whistling simultaneous from drums, both spatially located to the left. In the second part, the whistling continues on the left while someone starts talking on the right. The top panel is a spectrogram of the whistling in isolation. The goal is to separate the whistle from the rest of the mixture.
Separating the audio mixture using direction of arrival. The top panel shows the output of the algorithm. The second panel shows the mixture. The third row contains a visualization of the spatial features. In the first part of the mixture, the algorithm fails as sounds are coming from only one location. In the second part, the algorithm succeeds.
Output of primitive separation algorithms on the mixture. Each primitive has made a mistake on some part of the mixture, but the combination of the primitives (4th row) has succeeded in isolating the whistle over the course of the entire mixture, despite the conditions of the mixture changing.
Bootstrapping a deep model from primitives. Instead of being trained with ground truth, the model is learned from the output of primitive source separation algorithms.
An auditory scene consisting of whistling, drums, speech, and singing sounds. The first part of the mixture is similar to the setup in the first part of the mixture in Figure 1.10. However, in the second part of the mixture, someone starts singing from the same location as the whistling, causing all of the primitives to fail. As a result, the ensemble of primitives (the top panel) fails to isolate the whistling sound
The output of the bootstrapped deep learning model on the new mixture. Unlike the ensemble of primitives that it was trained from, it is able to isolate the whistle in the second situation where the primitives have failed.