from nussl import jupyter_utils, AudioSignal
from wand.image import Image as WImage
import glob
import os
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; font-size:1em;}</style>"))
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Distribution of confidence measure (i.e. predicted performance) and source-to-distortion (i.e. actual performance) for the direction of arrival separation algorithm as applied to 5000 mixtures in the validation set. On the right, we can see that the algorithm actually fails about half the time (the peak at 0 dB SDR). We also get a peak at 0.0 confidence.}
WImage(filename='../spatial_confidence_relationship.png')
folders = ['spcl_confidence/']
high_confidence_signals = []
low_confidence_signals = []
for folder in folders:
audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
for a in audio_files:
label = []
if 'high' in a:
label.append('High confidence')
if 'low' in a:
label.append('Low confidence')
if 'mix' in a:
label.append('Mixture')
if '0' in a:
label.append('Speaker 0')
if '1' in a:
label.append('Speaker 1')
label = ' - '.join(label)
if 'High' in label:
high_confidence_signals.append((label, AudioSignal(a)))
if 'Low' in label:
low_confidence_signals.append((label, AudioSignal(a)))
n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]
for h in high_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')
for h in low_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')
An example of audio representations and mixing sources. In the top image, a spectrogram for a single audio source consisting of a piano playing 3 notes is shown. In the second, an example of speech. In the third image, a crumpling plastic cup. Finally, the last image has the time-frequency representation of the three sources mixed together. It is very hard to disambiguate the Relationship between confidence measure and actual performance for music mixtures, using primitive clustering to separate each mixture. There are $100$ points in each plot, representing $100$ separated sources that are either accompaniment (top row) or vocals (bottom row). We see a strong correlation between confidence and actual performance for accompaniment. We see a similar correlation for SDR and SIR for vocals, but no correlation for SAR. The blue line is the line of best fit found via linear regression. The p-values and r-values for each regression are overlaid on each scatter plot.three sources from one another in the image, especially in the times where the sources overlap in activity.
WImage(filename='../pcl_confidence_relationship.pdf')
folders = ['pcl_confidence/pair1/', 'pcl_confidence/pair2/', 'pcl_confidence/pair3/']
high_confidence_signals = []
low_confidence_signals = []
for folder in folders:
audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
for a in audio_files:
label = []
if 'high' in a:
label.append('High confidence')
if 'low' in a:
label.append('Low confidence')
if 'mix' in a:
label.append('Mixture')
if 'fg' in a:
label.append('Foreground')
if 'bg' in a:
label.append('Background')
label = ' - '.join(label)
if 'High' in label:
high_confidence_signals.append((label, AudioSignal(a)))
if 'Low' in label:
low_confidence_signals.append((label, AudioSignal(a)))
n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]
These are separations of a random selection of songs from YouTube (the dataset used in Chapter 4).
for h in high_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')
These are separations of a random selection of songs from YouTube (the dataset used in Chapter 4).
for h in low_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')
Relationship between the proposed confidence measure and three actual performance measures - SDR/SIR/SAR - for deep clustering. Each dot represents a separated source from the WSJ0 spatialized anechoic testing dataset ($N = 3000$). The blue line is the line of best fit found via linear regression. The p-values (measured via a Wald test \cite{gourieroux1982likelihood}) and r-values for each regression are overlaid on each scatter plot.
WImage(filename='../dpcl_confidence_relationship_speech.png')
folders = ['dpcl_confidence//']
high_confidence_signals = []
low_confidence_signals = []
for folder in folders:
audio_files = sorted(glob.glob(f'{folder}/*.mp3'))
for a in audio_files:
label = []
if 'high' in a:
label.append('High confidence')
if 'low' in a:
label.append('Low confidence')
if 'mix' in a:
label.append('Mixture')
if '0' in a:
label.append('Speaker 0')
if '1' in a:
label.append('Speaker 1')
label = ' - '.join(label)
if 'High' in label:
high_confidence_signals.append((label, AudioSignal(a)))
if 'Low' in label:
low_confidence_signals.append((label, AudioSignal(a)))
n = 3
high_confidence_signals = [high_confidence_signals[i:i+n] for i in range(0, len(high_confidence_signals), n)]
low_confidence_signals = [low_confidence_signals[i:i+n] for i in range(0, len(low_confidence_signals), n)]
for h in high_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')
for h in low_confidence_signals:
for l, a in h[::-1]:
print(l)
jupyter_utils.embed_audio(a, display=True)
print('\n\n')