The MIT License (MIT)

Copyright (c) 2014 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

AUTHORS
Hervé Bredin -- http://herve.niderb.fr

# setting up IPython Notebook so that it displays nicely
# you can safely skip this cell...
%pylab inline
pylab.rcParams['figure.figsize'] = (10.0, 5.0)
from pyannote.core.notebook import set_notebook_crop, Segment
set_notebook_crop(Segment(0, 60))

Populating the interactive namespace from numpy and matplotlib

Aligning transcripts with subtitles¶

Start by loading The Big Bang Theory subset of the TVD dataset

from tvd import TheBigBangTheory
dataset = TheBigBangTheory('/Volumes/data/tvd/')


IN CASE YOU USE 'speaker' RESOURCES, PLEASE CONSIDER CITING:
@inproceedings{Tapaswi2012
    title = {{``Knock! Knock! Who is it?'' Probabilistic Person Identification in TV Series}},
    author = {Makarand Tapaswi and Martin B\"{a}uml and Rainer Stiefelhagen},
    booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2012},
    month = {June},
}


IN CASE YOU USE 'outline' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}


IN CASE YOU USE 'transcript' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'transcript_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'outline_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}

For illustration purposes, we will only focus on the first episode of the TV series

firstEpisode = dataset.episodes[0]
print firstEpisode

TheBigBangTheory.Season01.Episode01

By chance, for this episode, both the english transcript and the subtitles are available in the dataset.

transcript = dataset.get_resource('transcript', firstEpisode)

pyannote.parser provides a subtitle file (.srt) parser.
It also takes care of dividing (split=True) subtitles that cover multiple dialogue lines from several speaker.
Additionally (duration=True), it allottes to each line a duration proportional to their number of words.

from pyannote.parser.srt import SRTParser
parser = SRTParser(split=True, duration=True)
pathToSRTFile = dataset.path_to_subtitles(firstEpisode, language='en')
subtitles = parser.read(pathToSRTFile)

For illustration purposes, let us focus on the first minute of the episode.

subtitles = subtitles.crop(0, 60)
transcript = transcript.crop('C', 'T')

The transcript provides the exact transcription of the dialogues and characters' names.
However, it does not contain any timing information (other than the obvious chronological order).

transcript

On the other hand, subtitles only provide coarse dialogue transcription.
However, they do provide raw timing information.

subtitles

Our objective here is to combine those two sources of information in order to get as close as possible to the actual reference annotation:

reference = dataset.get_resource('speaker', firstEpisode)
# 'speaker' resources contains plenty of labels (e.g music, applause, silence, etc...) 
# we are only going to use the ones starting by 'speech_'
# (e.g. speech_sheldon, speech_penny, speech_raj, etc...)
reference = reference.subset(set([label for label in reference.labels() if label.startswith('speech')]))
# next, for better display below, we change each 'speech_somebody' label in 'SOMEBODY'
# (e.g speech_sheldon becomes SHELDON)
reference = reference.translate({label: label[7:].upper() for label in reference.labels()})
# now, we just keep the first minute for illustration purposes
reference = reference.crop(Segment(0,60))
# this should display nicely in IPython Notebook
reference

We are now going to use dynamic time warping to align the transcript with the subtitles based on the actual text.

from pyannote.features.text.preprocessing import TextPreProcessing
from pyannote.features.text.tfidf import TFIDF

# this is the default text (i.e. subtitles or speech transcript) pre-processing 
preprocessing = TextPreProcessing(
    tokenize=True,    # step 1: tokenization of sentences into words
    lemmatize=True,   # step 2: lemmatization
    stem=True,        # step 3: stemming
    stopwords=True,   # step 4: remove stop-words
    pos_tag=True,     
    keep_pos=True,    # step 5: only keep nouns, adjectives, verbs and adverbs
    min_length=2)     # step 6: remove stems shorter than 2 letters

# this is the TF-IDF transformer that will project 
# each pre-processed text into a fixed-size vector
tfidf = TFIDF(preprocessing=preprocessing,  # use just-defined pre-processing
              binary=True)                  # use binary term-frequency

The actual alignment is done by an instance of TFIDFAlignment (more details on what it does below).

from pyannote.algorithms.alignment.transcription import TFIDFAlignment
aligner = TFIDFAlignment(tfidf, adapt=True)
merged = aligner(transcript, subtitles, vattribute='speech', hattribute='subtitle')

Bam! Merged!

merged

We end up with a timestamped transcript which could, in turn, be used to train acoustic speaker identification models.
Let us quickly convert it into an Annotation object that can be easily visualize:

from pyannote.core import Annotation
hypothesis = Annotation()
for start_time, end_time, data in merged.edges_iter(data=True):
    if 'speaker' in data:
        hypothesis[Segment(start_time, end_time)] = data['speaker']
hypothesis

See? We managed to obtain a speaker annotation not that far away from the actual groundtruth:

reference

Under the hood (TF-IDF + cosine distance + dynamic time warping)¶

from pyannote.algorithms.alignment.dtw import DynamicTimeWarping

TF-IDF¶

Under the hood, each dialogue line is first changed into a TF-IDF vector

text = 'If a photon is directed through a plane with two slits in it...'
preprocessing(text)

['photon', 'direct', 'plane', 'slit']

vector = tfidf.transform([text])[0,:].toarray().squeeze()
print vector

[ 0.          0.          0.          0.          0.          0.52200489
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.52200489  0.          0.47698103  0.          0.          0.
  0.          0.          0.          0.47698103  0.          0.          0.
  0.          0.          0.        ]

vector has non-zero value only for dimensions corresponding to its words.

for i, value in enumerate(vector):
    if value > 0:   # only print non-zero values
        print tfidf._cv.get_feature_names()[i], value

direct 0.522004885041
photon 0.522004885041
plane 0.476981026869
slit 0.476981026869

Cosine distance¶

Next, cosine distance between all pairs of transcript and subtitle lines is computed.

vsequence = aligner._get_sequence(transcript, 'speech')
hsequence = aligner._get_sequence(subtitles, 'subtitle')
distance = aligner.pairwise_distance(vsequence, hsequence)
im = imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();

Dynamic time warping¶

Dynamic time warping ends the process by looking for the path with minimum overall cost (in white below).

dtw = DynamicTimeWarping()
path = dtw(range(len(vsequence)), range(len(hsequence)), distance=distance)

for i, j in path:
    distance[i, j] = np.NaN
imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();