In [2]:
from nussl import jupyter_utils, AudioSignal
In [41]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; font-size:1em;}</style>"))
In [16]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[16]:

Figure 1.2

An example of audio representations. The top plot depicts the time-series representation, while the bottom plot depicts the spectrogram (time-frequency) representation. The auditory scene consists of a small jazz band (consisting of saxophone, piano, drums, and bass), chatter produced by a crowd of people, and dishes clattering as they are being washed. It is difficult to visually separate the three sources from each other in either the time-frequency representation or the original time-series.

In [46]:
from wand.image import Image as WImage
img = WImage(filename='../mixture.png')
img
Out[46]:
In [47]:
audio = [
    ('mixture/mix2-261.wav', 'Mixture'),
]
for a, l in audio:
    print(l)
    s = AudioSignal(a)
    s.to_mono()
    jupyter_utils.embed_audio(s, display=True)
Mixture

Figure 1.3

An example of time-frequency assignments in the spectrogram from Figure 1.2. The auditory scene consists of a small jazz band consisting of saxophone, piano, drums, and bass, chatter produced by a crowd of people, and dishes clattering as they are being washed. Dark blue represents silence, light blue represents the jazz band, red the chatter, and yellow the dishes.

In [48]:
from wand.image import Image as WImage
img = WImage(filename='../separated.png')
img
Out[48]:
In [49]:
audio = [
    ('mixture/separated_music-268.wav', 'Separated music (light blue mask)'),
]
for a, l in audio:
    print(l)
    s = AudioSignal(a)
    s.to_mono()
    jupyter_utils.embed_audio(s, display=True)
Separated music (light blue mask)

Figure 1.4

An example of audio representations and mixing sources. In the top image, a spectrogram for a single audio source consisting of a piano playing 3 notes is shown. In the second, an example of speech. In the third image, a crumpling plastic cup. Finally, the last image has the time-frequency representation of the three sources mixed together. It is very hard to disambiguate the three sources from one another in the image, especially in the times where the sources overlap in activity.

In [44]:
from wand.image import Image as WImage
img = WImage(filename='../safety-dance-crop.pdf')
img
Out[44]:
In [45]:
audio = [
    ('safety_dance/mix Mixdown 1_L.wav', 'Mixture of cup, piano, voice'),
]
for a, l in audio:
    print(l)
    s = AudioSignal(a)
    s.to_mono()
    jupyter_utils.embed_audio(s, display=True)
Mixture of cup, piano, voice

Figure 1.5

Non-negative matrix factorization (NMF) applied to a mixture of two marimbas playing simultaneously. The two marimbas have different timbres. One is played with a soft mallet, resulting in a more harmonic tone, while the other uses a hard mallet, resulting in a more percussive tone. Applying NMF to the mixture and clustering the basis set by timbre recovers the two sources successfully.

In [4]:
from wand.image import Image as WImage
img = WImage(filename='../marimba_timbre.pdf')
img
Out[4]:
In [8]:
audio_files = ['marimba_timbre/a_mix.mp3', 'marimba_timbre/a_s0.mp3', 'marimba_timbre/a_s1.mp3']
audio_labels = ['Mixture: 2 Marimbas', 'NMF Source 1', 'NMF Source 2']

for a, l in zip(audio_files, audio_labels):
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Mixture: 2 Marimbas
NMF Source 1
NMF Source 2

Figure 1.6

Non-negative matrix factorization (NMF) applied to a different mixture of two marimbas playing simultaneously. The two marimbas have the same timbre, but are separated in pitch. NMF fails to separate the two marimbas. The first source it discovers is merely the "attack" (the percussive start of each note), while the other source is the "sustain" (the harmonic horizontal component of each note). The two sources have similar characteristics, causing the parts-based decomposition approach to fail.

In [10]:
from wand.image import Image as WImage
img = WImage(filename='../marimba.pdf')
img
Out[10]:
In [11]:
audio_files = ['marimba/a_mix.mp3', 'marimba/a_s0.mp3', 'marimba/a_s1.mp3']
audio_labels = ['Mixture: 2 Marimbas', 'NMF Source 1', 'NMF Source 2']

for a, l in zip(audio_files, audio_labels):
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Mixture: 2 Marimbas
NMF Source 1
NMF Source 2

Figure 1.7

Output of primitive separation algorithms on different kinds of mixtures. In the top row, the constituent sources are shown. The goal is to extract the target sound, shown in the top right. When the target is mixed with repeating interference (first column), a repetition-based algorithm works. When the target is mixed with non-repetitive interference (second column), the algorithm fails. Finally, a direction-of-arrival-based algorithm (third column) works on the same non-repetitive mixture, as the two sources are spatially separated in the auditory scene.

In [12]:
from wand.image import Image as WImage
img = WImage(filename='../primitive_separation_edit.pdf')
img
Out[12]:
In [14]:
audio = [
    ('primitive_separation_edit/repeating_interference.mp3', 'Repetitive interfering source'),
    ('primitive_separation_edit/non_repeating_interference.mp3', 'Non-repetitive interfering source'),
    ('primitive_separation_edit/target_sound.mp3', 'Target sound'),
    ('primitive_separation_edit/repetitive_mixture.mp3', 'Repetitive mixture'),
    ('primitive_separation_edit/repetitive_separation.mp3', 'Separation via repetition (first column)'),
    ('primitive_separation_edit/non_repetitive_mixture.mp3', 'Non-repetitive mixture'),
    ('primitive_separation_edit/non_repetitve_separation.mp3', 'Separation via repetition (second column)'),
    ('primitive_separation_edit/doa_mixture.mp3', 'Non-repetitive stereo mixture'),
    ('primitive_separation_edit/separation_via_doa.mp3', 'Separation via direction of arrival'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Repetitive interfering source
Non-repetitive interfering source
Target sound
Repetitive mixture
Separation via repetition (first column)
Non-repetitive mixture
Separation via repetition (second column)
Non-repetitive stereo mixture
Separation via direction of arrival

Figure 1.9

Output of a deep model on different mixtures. The model here is state-of-the-art and is trained to separate speech mixtures. In the top row are the ground truth sources used to create the mixtures in the second row. They are recordings of speech, whistling, and clapping. On the speech + speech recording, the deep model performs well (first column), as it was trained for this task. However, on the other two mixtures, it fails to separate the scene into coherent sources, mixing up the whistle and speech (second column) and the whistle and clapping (third column).

In [17]:
from wand.image import Image as WImage
img = WImage(filename='../deep_failures.pdf')
img
Out[17]:
In [18]:
audio = [
    ('deep_failures/speech+speech.mp3', 'Speech + speech'),
    ('deep_failures/speech+speech_s0.mp3', 'Speech + speech - 1st recovered source'),
    ('deep_failures/speech+speech_s1.mp3', 'Speech + speech - 2nd recovered source'),
    
    ('deep_failures/speech+whistle.mp3', 'Speech + whistle'),
    ('deep_failures/speech+whistle_s0.mp3', 'Speech + whistle - 1st recovered source'),
    ('deep_failures/speech+whistle_s1.mp3', 'Speech + whistle - 2nd recovered source'),
    
    ('deep_failures/whistle+clap.mp3', 'Whistle + clap'),
    ('deep_failures/whistle+clap_s0.mp3', 'Whistle + clap - 1st recovered source'),
    ('deep_failures/whistle+clap_s1.mp3', 'Whistle + clap - 2nd recovered source'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Speech + speech
Speech + speech - 1st recovered source
Speech + speech - 2nd recovered source
Speech + whistle
Speech + whistle - 1st recovered source
Speech + whistle - 2nd recovered source
Whistle + clap
Whistle + clap - 1st recovered source
Whistle + clap - 2nd recovered source

Figure 1.10

An auditory scene (second panel from top) consisting of whistling, drums and speech sounds. In the first part of the mixture, there is whistling simultaneous from drums, both spatially located to the left. In the second part, the whistling continues on the left while someone starts talking on the right. The top panel is a spectrogram of the whistling in isolation. The goal is to separate the whistle from the rest of the mixture.

In [20]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/mix_description-crop.pdf')
img
Out[20]:
In [21]:
audio = [
    ('mix_description/mix_extreme.m4a', 'Mixture'),
    ('mix_description/target_whistle.m4a', 'Target (whistling)'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Mixture
Target (whistling)

Figure 1.11

Separating the audio mixture using direction of arrival. The top panel shows the output of the algorithm. The second panel shows the mixture. The third row contains a visualization of the spatial features. In the first part of the mixture, the algorithm fails as sounds are coming from only one location. In the second part, the algorithm succeeds.

In [22]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/spatial_over_time-crop.pdf')
img
Out[22]:
In [23]:
audio = [
    ('spatial_over_time/easy_doa.wav', 'Separation via direction of arrival'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Separation via direction of arrival

Figure 1.12

Output of primitive separation algorithms on the mixture. Each primitive has made a mistake on some part of the mixture, but the combination of the primitives (4th row) has succeeded in isolating the whistle over the course of the entire mixture, despite the conditions of the mixture changing.

In [24]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/primitive_ensemble-crop.pdf')
img
Out[24]:
In [25]:
audio = [
    ('primitive_ensemble/easy_modulation.wav', 'Primitive: micromodulation'),
    ('primitive_ensemble/easy_doa.wav', 'Primitive: direction of arrival'),
    ('primitive_ensemble/easy_proximity.wav', 'Primitive: time and pitch proximity'),
    ('primitive_ensemble/easy_ensemble_1.wav', 'Ensemble of primitives'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Primitive: micromodulation
Primitive: direction of arrival
Primitive: time and pitch proximity
Ensemble of primitives

Figure 1.14

Bootstrapping a deep model from primitives. Instead of being trained with ground truth, the model is learned from the output of primitive source separation algorithms.

In [26]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/bootstrapping_process-crop.pdf')
img
Out[26]:
In [27]:
audio = [
    ('bootstrapping_process/easy_ensemble_1.wav', 'Ensemble of primitives'),
    ('bootstrapping_process/mix_extreme.m4a', 'Mixture'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Ensemble of primitives
Mixture

Figure 1.15

An auditory scene consisting of whistling, drums, speech, and singing sounds. The first part of the mixture is similar to the setup in the first part of the mixture in Figure 1.10. However, in the second part of the mixture, someone starts singing from the same location as the whistling, causing all of the primitives to fail. As a result, the ensemble of primitives (the top panel) fails to isolate the whistling sound

In [28]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/harder_mix_description-crop.pdf')
img
Out[28]:
In [29]:
audio = [
    ('harder_mix_description/harder_ensemble_1.wav', 'Ensemble of primitives'),
    ('harder_mix_description/other_target_whistle.m4a', 'New target (whistling)'),
    ('harder_mix_description/mix_with_singer_diff_whistle.m4a', 'New mixture'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
Ensemble of primitives
New target (whistling)
New mixture

Figure 1.16

The output of the bootstrapped deep learning model on the new mixture. Unlike the ensemble of primitives that it was trained from, it is able to isolate the whistle in the second situation where the primitives have failed.

In [30]:
from wand.image import Image as WImage
img = WImage(filename='../toy_example/output_of_bootstrapped_model-crop.pdf')
img
Out[30]:
In [31]:
audio = [
    ('bootstrapping_output/mix_with_singer_diff_whistle.m4a', 'New mixture'),
    ('bootstrapping_output/other_target_whistle.m4a', 'New target (whistling)'),
    ('bootstrapping_output/harder_ensemble_1.wav', 'Ensemble of primitives'),
    ('bootstrapping_output/harder_dpcl_1.wav', 'Output of bootstrapped deep model'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)
New mixture
New target (whistling)
Ensemble of primitives
Output of bootstrapped deep model