In [6]:
from nussl import jupyter_utils, AudioSignal
from wand.image import Image as WImage
import glob
import os

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; font-size:1em;}</style>"))

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[6]:

Figure 3.6

Distribution of confidence measure (i.e. predicted performance) and source-to-distortion (i.e. actual performance) for the direction of arrival separation algorithm as applied to 5000 mixtures in the validation set. On the right, we can see that the algorithm actually fails about half the time (the peak at 0 dB SDR). We also get a peak at 0.0 confidence.}

In [24]:
WImage(filename='../spatial_confidence_relationship.png')
Out[24]:
In [25]:
folders = ['spcl_confidence/']
high_confidence_signals = []
low_confidence_signals = []

for folder in folders:
    audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
    for a in audio_files:
        label = []
        if 'high' in a:
            label.append('High confidence')
        if 'low' in a:
            label.append('Low confidence')
        if 'mix' in a:
            label.append('Mixture')
        if '0' in a:
            label.append('Speaker 0')
        if '1' in a:
            label.append('Speaker 1')
        label = ' - '.join(label)
        if 'High' in label:
            high_confidence_signals.append((label, AudioSignal(a)))
        if 'Low' in label:
            low_confidence_signals.append((label, AudioSignal(a)))

n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]

Separation that spatial clustering was very confident about

In [26]:
for h in high_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
High confidence - Mixture
High confidence - Speaker 1
High confidence - Speaker 0


Separation that spatial clustering was not confident about

In [27]:
for h in low_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
Low confidence - Mixture
Low confidence - Speaker 1
Low confidence - Speaker 0


Figure 3.9

An example of audio representations and mixing sources. In the top image, a spectrogram for a single audio source consisting of a piano playing 3 notes is shown. In the second, an example of speech. In the third image, a crumpling plastic cup. Finally, the last image has the time-frequency representation of the three sources mixed together. It is very hard to disambiguate the Relationship between confidence measure and actual performance for music mixtures, using primitive clustering to separate each mixture. There are $100$ points in each plot, representing $100$ separated sources that are either accompaniment (top row) or vocals (bottom row). We see a strong correlation between confidence and actual performance for accompaniment. We see a similar correlation for SDR and SIR for vocals, but no correlation for SAR. The blue line is the line of best fit found via linear regression. The p-values and r-values for each regression are overlaid on each scatter plot.three sources from one another in the image, especially in the times where the sources overlap in activity.

In [3]:
WImage(filename='../pcl_confidence_relationship.pdf')
Out[3]:
In [18]:
folders = ['pcl_confidence/pair1/', 'pcl_confidence/pair2/', 'pcl_confidence/pair3/']
high_confidence_signals = []
low_confidence_signals = []

for folder in folders:
    audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
    for a in audio_files:
        label = []
        if 'high' in a:
            label.append('High confidence')
        if 'low' in a:
            label.append('Low confidence')
        if 'mix' in a:
            label.append('Mixture')
        if 'fg' in a:
            label.append('Foreground')
        if 'bg' in a:
            label.append('Background')
        label = ' - '.join(label)
        if 'High' in label:
            high_confidence_signals.append((label, AudioSignal(a)))
        if 'Low' in label:
            low_confidence_signals.append((label, AudioSignal(a)))

n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]

Separations that primitive clustering was very confident about

These are separations of a random selection of songs from YouTube (the dataset used in Chapter 4).

In [22]:
for h in high_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
High confidence - Mixture
High confidence - Foreground
High confidence - Background


High confidence - Mixture
High confidence - Foreground
High confidence - Background


High confidence - Mixture
High confidence - Foreground
High confidence - Background


Separations that primitive clustering was not confident about

These are separations of a random selection of songs from YouTube (the dataset used in Chapter 4).

In [21]:
for h in low_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
Low confidence - Mixture
Low confidence - Foreground
Low confidence - Background


Low confidence - Mixture
Low confidence - Foreground
Low confidence - Background


Low confidence - Mixture
Low confidence - Foreground
Low confidence - Background


Figure 3.11

Relationship between the proposed confidence measure and three actual performance measures - SDR/SIR/SAR - for deep clustering. Each dot represents a separated source from the WSJ0 spatialized anechoic testing dataset ($N = 3000$). The blue line is the line of best fit found via linear regression. The p-values (measured via a Wald test \cite{gourieroux1982likelihood}) and r-values for each regression are overlaid on each scatter plot.

In [29]:
WImage(filename='../dpcl_confidence_relationship_speech.png')
Out[29]:
In [30]:
folders = ['dpcl_confidence//']
high_confidence_signals = []
low_confidence_signals = []

for folder in folders:
    audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
    for a in audio_files:
        label = []
        if 'high' in a:
            label.append('High confidence')
        if 'low' in a:
            label.append('Low confidence')
        if 'mix' in a:
            label.append('Mixture')
        if '0' in a:
            label.append('Speaker 0')
        if '1' in a:
            label.append('Speaker 1')
        label = ' - '.join(label)
        if 'High' in label:
            high_confidence_signals.append((label, AudioSignal(a)))
        if 'Low' in label:
            low_confidence_signals.append((label, AudioSignal(a)))

n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]

Separation that deep clustering was very confident about

In [31]:
for h in high_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
High confidence - Mixture
High confidence - Speaker 1
High confidence - Speaker 0


Separation that deep clustering was not confident about

In [32]:
for h in low_confidence_signals:
    for l, a in h[::-1]:
        print(l)
        jupyter_utils.embed_audio(a, display=True)
    print('\n\n')
Low confidence - Mixture
Low confidence - Speaker 1
Low confidence - Speaker 0