The MIT License (MIT)

Copyright (c) 2014 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

AUTHORS
Hervé Bredin -- http://herve.niderb.fr
In [1]:
# setting up IPython Notebook so that it displays nicely
# you can safely skip this cell...
%pylab inline
pylab.rcParams['figure.figsize'] = (10.0, 5.0)
from pyannote.core.notebook import set_notebook_crop, Segment
set_notebook_crop(Segment(0, 60))
Populating the interactive namespace from numpy and matplotlib

Aligning transcripts with subtitles

Start by loading The Big Bang Theory subset of the TVD dataset

In [2]:
from tvd import TheBigBangTheory
dataset = TheBigBangTheory('/Volumes/data/tvd/')

IN CASE YOU USE 'speaker' RESOURCES, PLEASE CONSIDER CITING:
@inproceedings{Tapaswi2012
    title = {{``Knock! Knock! Who is it?'' Probabilistic Person Identification in TV Series}},
    author = {Makarand Tapaswi and Martin B\"{a}uml and Rainer Stiefelhagen},
    booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2012},
    month = {June},
}


IN CASE YOU USE 'outline' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}


IN CASE YOU USE 'transcript' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'transcript_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{bigbangtrans,
    title = {{big bang theory transcripts}},
    howpublished = \url{http://bigbangtrans.wordpress.com/}
}


IN CASE YOU USE 'outline_www' RESOURCES, PLEASE CONSIDER CITING:
@misc{the-big-bang-theory.com,
    title = {{The Big Bang Theory Wiki}},
    howpublished = \url{http://wiki.the-big-bang-theory.com/}
}


For illustration purposes, we will only focus on the first episode of the TV series

In [3]:
firstEpisode = dataset.episodes[0]
print firstEpisode
TheBigBangTheory.Season01.Episode01

By chance, for this episode, both the english transcript and the subtitles are available in the dataset.

In [4]:
transcript = dataset.get_resource('transcript', firstEpisode)

pyannote.parser provides a subtitle file (.srt) parser.
It also takes care of dividing (split=True) subtitles that cover multiple dialogue lines from several speaker.
Additionally (duration=True), it allottes to each line a duration proportional to their number of words.

In [5]:
from pyannote.parser.srt import SRTParser
parser = SRTParser(split=True, duration=True)
pathToSRTFile = dataset.path_to_subtitles(firstEpisode, language='en')
subtitles = parser.read(pathToSRTFile)

For illustration purposes, let us focus on the first minute of the episode.

In [6]:
subtitles = subtitles.crop(0, 60)
transcript = transcript.crop('C', 'T')

The transcript provides the exact transcription of the dialogues and characters' names.
However, it does not contain any timing information (other than the obvious chronological order).

On the other hand, subtitles only provide coarse dialogue transcription.
However, they do provide raw timing information.

In [8]:
subtitles
Out[8]:
%3 4.0 4.0 5.149 5.149 4.0->5.149 subtitle ...and either is observed... 5.32 5.32 5.149->5.32 32.077 32.077 32.24 32.24 32.077->32.24 34.629 34.629 32.24->34.629 subtitle Twenty-six across is MCM. 60.598 60.598 15.068 15.068 15.239 15.239 15.068->15.239 18.072 18.072 15.239->18.072 subtitle There's no point. I just th... 56.911 56.911 57.8 57.8 56.911->57.8 57.8->60.598 subtitle If you have to ask. maybe y... 3.833 3.833 3.833->4.0 35.96 35.96 37.712 37.712 35.96->37.712 subtitle Move your finger. 37.88 37.88 37.712->37.88 40.599 40.599 41.999 41.999 40.599->41.999 45.833 45.833 41.999->45.833 subtitle See. Papa Doc's capital ide... 13.839 13.839 13.999 13.999 13.839->13.999 13.999->15.068 subtitle What's your point? 29.119 29.119 29.119->32.077 subtitle One across is Aegean. eight... 7.719 7.719 11.109 11.109 7.719->11.109 subtitle If it's observed after it's... 11.279 11.279 11.109->11.279 1.239 1.239 1.239->3.833 subtitle If a photon is directed thr... 52.44 52.44 52.44->56.911 subtitle Um. is this the high?|O spe... 25.914 25.914 25.914->29.119 34.8 34.8 34.629->34.8 35.789 35.789 34.8->35.789 subtitle Fourteen down is-- 50.198 50.198 50.198->52.44 37.88->40.599 subtitle --phylum. which makes 14 ac... 48.879 48.879 49.9342 49.9342 48.879->49.9342 subtitle -Can I help you? 49.9342->50.198 subtitle Yes. 24.8554736842 24.8554736842 24.8554736842->25.914 subtitle Hang on. 7.55 7.55 7.55->7.719 35.789->35.96 13.4806 13.4806 13.4806->13.839 subtitle Agreed. 5.32->7.55 subtitle ...it will not go through b... 23.4 23.4 18.072->23.4 23.4->24.8554736842 subtitle -Excuse me. 11.279->13.4806 subtitle -...it will not've gone thr... 47.992 47.992 47.992->48.879 46.479 46.479 45.833->46.479 46.479->47.992 subtitle Haiti.

Our objective here is to combine those two sources of information in order to get as close as possible to the actual reference annotation:

In [9]:
reference = dataset.get_resource('speaker', firstEpisode)
# 'speaker' resources contains plenty of labels (e.g music, applause, silence, etc...) 
# we are only going to use the ones starting by 'speech_'
# (e.g. speech_sheldon, speech_penny, speech_raj, etc...)
reference = reference.subset(set([label for label in reference.labels() if label.startswith('speech')]))
# next, for better display below, we change each 'speech_somebody' label in 'SOMEBODY'
# (e.g speech_sheldon becomes SHELDON)
reference = reference.translate({label: label[7:].upper() for label in reference.labels()})
# now, we just keep the first minute for illustration purposes
reference = reference.crop(Segment(0,60))
# this should display nicely in IPython Notebook
reference
Out[9]:

We are now going to use dynamic time warping to align the transcript with the subtitles based on the actual text.

In [10]:
from pyannote.features.text.preprocessing import TextPreProcessing
from pyannote.features.text.tfidf import TFIDF

# this is the default text (i.e. subtitles or speech transcript) pre-processing 
preprocessing = TextPreProcessing(
    tokenize=True,    # step 1: tokenization of sentences into words
    lemmatize=True,   # step 2: lemmatization
    stem=True,        # step 3: stemming
    stopwords=True,   # step 4: remove stop-words
    pos_tag=True,     
    keep_pos=True,    # step 5: only keep nouns, adjectives, verbs and adverbs
    min_length=2)     # step 6: remove stems shorter than 2 letters

# this is the TF-IDF transformer that will project 
# each pre-processed text into a fixed-size vector
tfidf = TFIDF(preprocessing=preprocessing,  # use just-defined pre-processing
              binary=True)                  # use binary term-frequency 

The actual alignment is done by an instance of TFIDFAlignment (more details on what it does below).

In [11]:
from pyannote.algorithms.alignment.transcription import TFIDFAlignment
aligner = TFIDFAlignment(tfidf, adapt=True)
merged = aligner(transcript, subtitles, vattribute='speech', hattribute='subtitle')

Bam! Merged!

In [12]:
merged
Out[12]:
() 4.0 4.0 5.149 5.149 4.0->5.149 subtitle ...and either is observed... 5.32 5.32 5.149->5.32 32.077 32.077 32.24 32.24 32.077->32.24 34.629 34.629 32.24->34.629 subtitle Twenty-six across is MCM. 60.598 60.598 15.068 15.068 15.239 15.239 15.068->15.239 15.068->15.239 18.072 18.072 15.239->18.072 subtitle There's no point. I just th... 15.239->18.072 speech There's no point, I just th... speaker SHELDON 56.911 56.911 57.8 57.8 56.911->57.8 56.911->57.8 57.8->60.598 subtitle If you have to ask. maybe y... 57.8->60.598 speech If you have to ask, maybe y... speaker RECEPTIONIST 3.833 3.833 3.833->4.0 35.96 35.96 37.712 37.712 35.96->37.712 subtitle Move your finger. 37.88 37.88 37.712->37.88 40.599 40.599 41.999 41.999 40.599->41.999 45.833 45.833 41.999->45.833 subtitle See. Papa Doc's capital ide... 13.839 13.839 13.999 13.999 13.839->13.999 13.999->15.068 subtitle What's your point? 29.119 29.119 29.119->32.077 subtitle One across is Aegean. eight... 47.992 47.992 29.119->47.992 speech One across is Aegean, eight... speaker LEONARD 48.879 48.879 47.992->48.879 47.992->48.879 7.719 7.719 11.109 11.109 7.719->11.109 subtitle If it's observed after it's... 11.279 11.279 11.109->11.279 1.239 1.239 1.239->3.833 subtitle If a photon is directed thr... 13.4806 13.4806 1.239->13.4806 speech So if a photon is directed ... speaker SHELDON 13.4806->15.068 speech Agreed, what's your point? speaker LEONARD 13.4806->13.839 subtitle Agreed. 52.44 52.44 52.44->56.911 subtitle Um. is this the high?|O spe... 52.44->56.911 speech Yes. Um, is this the High I... speaker LEONARD 25.914 25.914 25.914->29.119 25.914->29.119 50.198 50.198 48.879->50.198 speech Can I help you? speaker RECEPTIONIST 49.9342 49.9342 48.879->49.9342 subtitle -Can I help you? 50.198->52.44 50.198->52.44 49.9342->50.198 subtitle Yes. 37.88->40.599 subtitle --phylum. which makes 14 ac... 34.8 34.8 34.629->34.8 35.789 35.789 34.8->35.789 subtitle Fourteen down is-- 24.8554736842 24.8554736842 24.8554736842->25.914 subtitle Hang on. 24.8554736842->25.914 speech Hang on. speaker RECEPTIONIST 7.55 7.55 7.55->7.719 35.789->35.96 5.32->7.55 subtitle ...it will not go through b... 23.4 23.4 18.072->23.4 18.072->23.4 23.4->24.8554736842 subtitle -Excuse me. 23.4->24.8554736842 speech Excuse me? speaker LEONARD 11.279->13.4806 subtitle -...it will not've gone thr... 46.479 46.479 45.833->46.479 46.479->47.992 subtitle Haiti.

We end up with a timestamped transcript which could, in turn, be used to train acoustic speaker identification models.
Let us quickly convert it into an Annotation object that can be easily visualize:

In [13]:
from pyannote.core import Annotation
hypothesis = Annotation()
for start_time, end_time, data in merged.edges_iter(data=True):
    if 'speaker' in data:
        hypothesis[Segment(start_time, end_time)] = data['speaker']
hypothesis
Out[13]:

See? We managed to obtain a speaker annotation not that far away from the actual groundtruth:

In [14]:
reference
Out[14]:

Under the hood (TF-IDF + cosine distance + dynamic time warping)

In [15]:
from pyannote.algorithms.alignment.dtw import DynamicTimeWarping

TF-IDF

Under the hood, each dialogue line is first changed into a TF-IDF vector

In [16]:
text = 'If a photon is directed through a plane with two slits in it...'
preprocessing(text)
Out[16]:
['photon', 'direct', 'plane', 'slit']
In [17]:
vector = tfidf.transform([text])[0,:].toarray().squeeze()
print vector
[ 0.          0.          0.          0.          0.          0.52200489
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.52200489  0.          0.47698103  0.          0.          0.
  0.          0.          0.          0.47698103  0.          0.          0.
  0.          0.          0.        ]

vector has non-zero value only for dimensions corresponding to its words.

In [18]:
for i, value in enumerate(vector):
    if value > 0:   # only print non-zero values
        print tfidf._cv.get_feature_names()[i], value
direct 0.522004885041
photon 0.522004885041
plane 0.476981026869
slit 0.476981026869

Cosine distance

Next, cosine distance between all pairs of transcript and subtitle lines is computed.

In [19]:
vsequence = aligner._get_sequence(transcript, 'speech')
hsequence = aligner._get_sequence(subtitles, 'subtitle')
distance = aligner.pairwise_distance(vsequence, hsequence)
im = imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();

Dynamic time warping

Dynamic time warping ends the process by looking for the path with minimum overall cost (in white below).

In [20]:
dtw = DynamicTimeWarping()
path = dtw(range(len(vsequence)), range(len(hsequence)), distance=distance)
In [21]:
for i, j in path:
    distance[i, j] = np.NaN
imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();