The MIT License (MIT)

Copyright (c) 2014 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

AUTHORS
Hervé Bredin -- http://herve.niderb.fr

# setting up IPython Notebook so that it displays nicely
# you can safely skip this cell...
%pylab inline
pylab.rcParams['figure.figsize'] = (10.0, 5.0)
from pyannote.core.notebook import set_notebook_crop, Segment
set_notebook_crop(Segment(0, 30))

Populating the interactive namespace from numpy and matplotlib

Evaluation corpus¶

Experiments are performed on The Big Bang Theory subset of the TVD corpus.
Instructions to reproduce this corpus locally are provided on TVD website.
The only prerequisite is that you acquire DVDs for the first season.

from tvd import TheBigBangTheory
dataset = TheBigBangTheory('/Volumes/data/tvd/')


IN CASE YOU USE 'speaker' RESOURCES, PLEASE CONSIDER CITING:
@inproceedings{Tapaswi2012
    title = {{``Knock! Knock! Who is it?'' Probabilistic Person Identification in TV Series}},
    author = {Makarand Tapaswi and Martin B\"{a}uml and Rainer Stiefelhagen},
    booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2012},
    month = {June},
}


IN CASE YOU USE 'outline' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}


IN CASE YOU USE 'transcript' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'transcript_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'outline_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}

Experiments are conducted on the first six episodes of the first season of The Big Bang Theory TV series.
Indeed, manual speech turns annotations are only available for these very episodes.

sixFirstEpisodes = dataset.episodes[:6]
for episode in sixFirstEpisodes:
    print episode

TheBigBangTheory.Season01.Episode01
TheBigBangTheory.Season01.Episode02
TheBigBangTheory.Season01.Episode03
TheBigBangTheory.Season01.Episode04
TheBigBangTheory.Season01.Episode05
TheBigBangTheory.Season01.Episode06

For illustration purposes, we will only focus on the very first episode of the series.

episode = sixFirstEpisodes[0]
print episode

TheBigBangTheory.Season01.Episode01

Audio tracks¶

Once reproduced locally, the TVD corpus provides audio tracks for every episode in every language available on DVDs.

english = dataset.path_to_audio(episode, language='en')
french = dataset.path_to_audio(episode, language='fr')
print 'English soundtrack:', english
print 'French soundtrack:', french

English soundtrack: /Volumes/data/tvd//TheBigBangTheory/dvd/rip/audio/TheBigBangTheory.Season01.Episode01.en.wav
French soundtrack: /Volumes/data/tvd//TheBigBangTheory/dvd/rip/audio/TheBigBangTheory.Season01.Episode01.fr.wav

Subtitles¶

english = dataset.path_to_subtitles(episode, language='en')
print 'English subtitles:', english

English subtitles: /Volumes/data/tvd//TheBigBangTheory/dvd/rip/subtitles/TheBigBangTheory.Season01.Episode01.en.srt

pyannote.parser provides a SRT parser that takes care of dividing (split=True) subtitles that cover multiple dialogue lines from several speaker, and allottes to each line a duration proportional to their number of words (duration=True).

from pyannote.parser.srt import SRTParser
subtitles = SRTParser(split=True, duration=True).read(english)
subtitles

Transcript¶

While subtitles provides coarse dialogue transcription with timestamps, transcripts, on the other side, provides exact dialogue transcription but no timing information other than the chronological order.

transcript = dataset.get_resource('transcript', episode)
transcript

Reference¶

The Big Bang Theory TVD plugin provides a resource called speaker containing manual annotation of the audio track.

manual_annotation = dataset.get_resource('speaker', episode)
manual_annotation

Manual annotation provides all sorts of labels...

all_labels = manual_annotation.labels()
all_labels

['music_titlesong',
 'silence',
 'sound_laugh',
 'sound_laughclap',
 'sound_other',
 'speech_howard',
 'speech_leonard',
 'speech_other',
 'speech_overlapping',
 'speech_penny',
 'speech_raj',
 'speech_sheldon']

... but only speech regions are of interest to us:

labels = [label for label in all_labels if label.startswith('speech_')]
speech_regions = manual_annotation.subset(set(labels))
speech_regions

Among those speech regions, we focus on the main 5 characters.

speech_regions.labels()

['speech_howard',
 'speech_leonard',
 'speech_other',
 'speech_overlapping',
 'speech_penny',
 'speech_raj',
 'speech_sheldon']

All other speech turns are marked as OTHER

main_characters = ['sheldon', 'leonard', 'penny', 'howard', 'raj']
mapping = {label: label[7:].upper() if label[7:] in main_characters else 'OTHER'
               for label in labels}
mapping

{'speech_howard': 'HOWARD',
 'speech_leonard': 'LEONARD',
 'speech_other': 'OTHER',
 'speech_overlapping': 'OTHER',
 'speech_penny': 'PENNY',
 'speech_raj': 'RAJ',
 'speech_sheldon': 'SHELDON'}

reference = speech_regions.translate(mapping)
reference

There you go: we now have our shiny speaker identification reference!