The MIT License (MIT)

Copyright (c) 2014 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

AUTHORS
Hervé Bredin -- http://herve.niderb.fr

In [1]:
# setting up IPython Notebook so that it displays nicely
# you can safely skip this cell...
%pylab inline
pylab.rcParams['figure.figsize'] = (10.0, 5.0)
from pyannote.core.notebook import set_notebook_crop, Segment
set_notebook_crop(Segment(0, 60))

Populating the interactive namespace from numpy and matplotlib



# Aligning transcripts with subtitles¶

Start by loading The Big Bang Theory subset of the TVD dataset

In [2]:
from tvd import TheBigBangTheory
dataset = TheBigBangTheory('/Volumes/data/tvd/')


IN CASE YOU USE 'speaker' RESOURCES, PLEASE CONSIDER CITING:
@inproceedings{Tapaswi2012
title = {{Knock! Knock! Who is it?'' Probabilistic Person Identification in TV Series}},
author = {Makarand Tapaswi and Martin B\"{a}uml and Rainer Stiefelhagen},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2012},
month = {June},
}

IN CASE YOU USE 'outline' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
title = {{The Big Bang Theory Wiki}},
howpublished = \url{http://wiki.the-big-bang-theory.com/}
}

IN CASE YOU USE 'transcript' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
title = {{big bang theory transcripts}},
howpublished = \url{http://bigbangtrans.wordpress.com/}
}

IN CASE YOU USE 'transcript_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
title = {{big bang theory transcripts}},
howpublished = \url{http://bigbangtrans.wordpress.com/}
}

IN CASE YOU USE 'outline_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
title = {{The Big Bang Theory Wiki}},
howpublished = \url{http://wiki.the-big-bang-theory.com/}
}



For illustration purposes, we will only focus on the first episode of the TV series

In [3]:
firstEpisode = dataset.episodes[0]
print firstEpisode

TheBigBangTheory.Season01.Episode01



By chance, for this episode, both the english transcript and the subtitles are available in the dataset.

In [4]:
transcript = dataset.get_resource('transcript', firstEpisode)


pyannote.parser provides a subtitle file (.srt) parser.
It also takes care of dividing (split=True) subtitles that cover multiple dialogue lines from several speaker.
Additionally (duration=True), it allottes to each line a duration proportional to their number of words.

In [5]:
from pyannote.parser.srt import SRTParser
parser = SRTParser(split=True, duration=True)
pathToSRTFile = dataset.path_to_subtitles(firstEpisode, language='en')


For illustration purposes, let us focus on the first minute of the episode.

In [6]:
subtitles = subtitles.crop(0, 60)
transcript = transcript.crop('C', 'T')


The transcript provides the exact transcription of the dialogues and characters' names.
However, it does not contain any timing information (other than the obvious chronological order).

On the other hand, subtitles only provide coarse dialogue transcription.
However, they do provide raw timing information.

In [8]:
subtitles

Out[8]:

Our objective here is to combine those two sources of information in order to get as close as possible to the actual reference annotation:

In [9]:
reference = dataset.get_resource('speaker', firstEpisode)
# 'speaker' resources contains plenty of labels (e.g music, applause, silence, etc...)
# we are only going to use the ones starting by 'speech_'
# (e.g. speech_sheldon, speech_penny, speech_raj, etc...)
reference = reference.subset(set([label for label in reference.labels() if label.startswith('speech')]))
# next, for better display below, we change each 'speech_somebody' label in 'SOMEBODY'
# (e.g speech_sheldon becomes SHELDON)
reference = reference.translate({label: label[7:].upper() for label in reference.labels()})
# now, we just keep the first minute for illustration purposes
reference = reference.crop(Segment(0,60))
# this should display nicely in IPython Notebook
reference

Out[9]:

We are now going to use dynamic time warping to align the transcript with the subtitles based on the actual text.

In [10]:
from pyannote.features.text.preprocessing import TextPreProcessing
from pyannote.features.text.tfidf import TFIDF

# this is the default text (i.e. subtitles or speech transcript) pre-processing
preprocessing = TextPreProcessing(
tokenize=True,    # step 1: tokenization of sentences into words
lemmatize=True,   # step 2: lemmatization
stem=True,        # step 3: stemming
stopwords=True,   # step 4: remove stop-words
pos_tag=True,
keep_pos=True,    # step 5: only keep nouns, adjectives, verbs and adverbs
min_length=2)     # step 6: remove stems shorter than 2 letters

# this is the TF-IDF transformer that will project
# each pre-processed text into a fixed-size vector
tfidf = TFIDF(preprocessing=preprocessing,  # use just-defined pre-processing
binary=True)                  # use binary term-frequency


The actual alignment is done by an instance of TFIDFAlignment (more details on what it does below).

In [11]:
from pyannote.algorithms.alignment.transcription import TFIDFAlignment
aligner = TFIDFAlignment(tfidf, adapt=True)
merged = aligner(transcript, subtitles, vattribute='speech', hattribute='subtitle')


Bam! Merged!

In [12]:
merged

Out[12]:

We end up with a timestamped transcript which could, in turn, be used to train acoustic speaker identification models.
Let us quickly convert it into an Annotation object that can be easily visualize:

In [13]:
from pyannote.core import Annotation
hypothesis = Annotation()
for start_time, end_time, data in merged.edges_iter(data=True):
if 'speaker' in data:
hypothesis[Segment(start_time, end_time)] = data['speaker']
hypothesis

Out[13]:

See? We managed to obtain a speaker annotation not that far away from the actual groundtruth:

In [14]:
reference

Out[14]:

### Under the hood (TF-IDF + cosine distance + dynamic time warping)¶

In [15]:
from pyannote.algorithms.alignment.dtw import DynamicTimeWarping


#### TF-IDF¶

Under the hood, each dialogue line is first changed into a TF-IDF vector

In [16]:
text = 'If a photon is directed through a plane with two slits in it...'
preprocessing(text)

Out[16]:
['photon', 'direct', 'plane', 'slit']

In [17]:
vector = tfidf.transform([text])[0,:].toarray().squeeze()
print vector

[ 0.          0.          0.          0.          0.          0.52200489
0.          0.          0.          0.          0.          0.          0.
0.          0.          0.          0.          0.          0.          0.
0.          0.          0.          0.          0.          0.          0.
0.          0.52200489  0.          0.47698103  0.          0.          0.
0.          0.          0.          0.47698103  0.          0.          0.
0.          0.          0.        ]



vector has non-zero value only for dimensions corresponding to its words.

In [18]:
for i, value in enumerate(vector):
if value > 0:   # only print non-zero values
print tfidf._cv.get_feature_names()[i], value

direct 0.522004885041
photon 0.522004885041
plane 0.476981026869
slit 0.476981026869



#### Cosine distance¶

Next, cosine distance between all pairs of transcript and subtitle lines is computed.

In [19]:
vsequence = aligner._get_sequence(transcript, 'speech')
hsequence = aligner._get_sequence(subtitles, 'subtitle')
distance = aligner.pairwise_distance(vsequence, hsequence)
im = imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();


#### Dynamic time warping¶

Dynamic time warping ends the process by looking for the path with minimum overall cost (in white below).

In [20]:
dtw = DynamicTimeWarping()
path = dtw(range(len(vsequence)), range(len(hsequence)), distance=distance)

In [21]:
for i, j in path:
distance[i, j] = np.NaN
imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();