from nussl import jupyter_utils, AudioSignal
from wand.image import Image as WImage

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; font-size:1em;}</style>"))

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Figure 2.2¶

The auditory scene starts with a pure sine tone and then a sine tone plus overtones. When the entire stack of sine tones starts to modulate slightly in frequency, the perception is that someone has started singing.

WImage(filename='../bregman_24.pdf')

audio = [
    ('bregman24/demonstration.mp3', 'Demonstration of frequency micromodulation'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Demonstration of frequency micromodulation

Figure 2.5¶

Rising tones and falling tones can easily be isolated using the 2DFT. Each row contains the time-frequency representation and its associated 2DFT. When different quadrants are symmetrically masked out in the 2DFT, one can isolate either rising components or falling components in a mixture.

WImage(filename='../rising_vs_falling.pdf')

audio = [
    ('rising_vs_falling/mix.mp3', 'Mixture of a rising tone and a falling tone'),
    ('rising_vs_falling/rising.mp3', 'Separation by 2DFT (falling tone)'),
    ('rising_vs_falling/falling.mp3', 'Separation by 2DFT (rising tone)'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Mixture of a rising tone and a falling tone

Separation by 2DFT (falling tone)

Separationg by 2DFT (rising tone)

Figure 2.6¶

Separation using the 2D Fourier Transform (2DFT). In the first row, the left panel shows the mixture spectrogram and the right panel its 2DFT. In the second row, I apply my peak picking technique to obtain a background 2DFT. Then, I invert this 2DFT and apply masking to the mixture spectrogram to get the background spectrogram. In the third row, I show everything from the rest of the 2DFT (i.e. the non-peaks or foreground), which contains the singing voice.

WImage(filename='../september.pdf')

audio = [
    ('september/mix.mp3', 'September - Earth, Wind, & Fire (mixture)'),
    ('september/bg.mp3', 'Background separated by 2DFT'),
    ('september/fg.mp3', 'Foreground separated by 2DFT'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

September - Earth, Wind, & Fire (mixture)

Background separated by 2DFT

Foreground separated by 2DFT

Figure 2.7¶

Response of the 2DFT separation algorithm to micromodulation. A pure tone is first played, along with overtones. When the tone begins to modulate in frequency, the 2DFT algorithm recognizes it as a foreground source, showing that it is sensitive to micromodulation cues in audio mixtures.

WImage(filename='../micromodulation.pdf')

audio = [
    ('micromodulation/mix.mp3', 'Demonstration of micromodulation'),
    ('micromodulation/response.mp3', 'Response of 2DFT to micromodulation'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Demonstration of micromodulation

Response of 2DFT to micromodulation

Figure 2.9¶

Response of the 2DFT separation algorithm to common frequency change. The tone is initially a complex tone with a fundamental and its overtones. It is perceive as a single sound. When half of its overtones begin to move, we perceive the mixture as containing two sounds. When the overtones stop moving, we once again perceive a single sound. The response of the 2DFT algorithm is the same.

WImage(filename='../common_frequency_change.pdf')

audio = [
    ('common_fate/mix.mp3', 'Fusion based on common frequency change (top row)'),
    ('common_fate/bg.mp3', 'Background separated by 2DFT'),
    ('common_fate/fg.mp3', 'Foreground separated by 2DFT'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Fusion based on common frequency change (top row)

Background separated by 2DFT

Foreground separated by 2DFT

Figure 2.10¶

Response of the 2DFT separation algorithm to repetition. In the mixture a repeating sound (shown on the right - "target sound") is played while a non-repeating sound is mixed with the target sound. The peak-picking procedure successfully isolates the repeating sound in the mixture, as can be seen by comparing the background spectrogram (middle row) with the target sound spectrogram.

WImage(filename='../repetition.pdf')

audio = [
    ('repetition/mix.mp3', 'Repetition stimuli'),
    ('repetition/repeating.mp3', 'Background separated by 2DFT'),
    ('repetition/non_repeating.mp3', 'Foreground separated by 2DFT'),
    ('repetition/stimuli.mp3', 'Target sound'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Repetition stimuli

Background separated by 2DFT

Foreground separated by 2DFT

Target sound

Figure 2.13¶

Pictured is the spectrogram of a musical mixture (above) and the ideal soft mask to separate out the vocals (below) for the mixture. The goal is to recover the soft mask below as closely as possible. The mask has values between 0 (blue) and 1 (red).

WImage(filename='../primitive_clustering_mixture-crop.pdf')

audio = [
    ('primitive_clustering/mixture.mp3', 'Mixture'),
    ('primitive_clustering/soft_mask.mp3', 'Ideal soft mask for vocals'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Mixture

Ideal soft mask for vocals

Figure 2.14¶

The process for making a primitive embedding space. A set of primitive algorithms is run on the mixture. Each algorithm produces a mask with values between 0 (blue) and 1 (red) that indicates how it is segmenting the auditory scene. The performance (using source-to-distortion ratio - SDR) for each algorithm is shown. Together, the three masks map each time-frequency point to a 3D embedding space, shown on the right. The marked point was classified by the three primitives as melodic, not repetitive, and harmonic.

WImage(filename='../primitive_clustering_overview-crop.pdf')

audio = [
    ('primitive_clustering/2dft_fg.mp3', 'Mask using repetition'),
    ('primitive_clustering/prox_fg.mp3', 'Mask using time and pitch proximity'),
    ('primitive_clustering/hpss_fg.mp3', 'Mask using harmonic/percussive'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Mask using repetition

Mask using time and pitch proximity

Mask using harmonic/percussive

Figure 2.15¶

The primitive embedding space is clustered via K-Means to produce a new mask that takes into account the decisions of each primitive. The more each primitive agrees on how to classify a source, the stronger the value in the primitive clustering mask. The ensemble of the primitives produces a mask that has better separation (by SDR) than any of the primitives by themselves.

WImage(filename='../primitive_clustering_result-crop.pdf')

audio = [
    ('primitive_clustering/pcl_bg.mp3', 'Mask via primitive clustering (background)'),
    ('primitive_clustering/pcl_fg.mp3', 'Mask via primitive clustering (foreground)'),
]
for a, l in audio:
    print(l)
    jupyter_utils.embed_audio(AudioSignal(a), display=True)

Mask via primitive clustering (background)

Mask via primitive clustering (foreground)

Compare the foreground from primitive clustering with the foreground via repetition above. Note that the snare has been noticeably suppressed in the primitive clustering estimate.