The MIT License (MIT)
Copyright (c) 2014 CNRS
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
AUTHORS
Hervé Bredin -- http://herve.niderb.fr
# setting up IPython Notebook so that it displays nicely
# you can safely skip this cell...
%pylab inline
pylab.rcParams['figure.figsize'] = (10.0, 5.0)
from pyannote.core.notebook import set_notebook_crop, Segment
set_notebook_crop(Segment(0, 60))
Start by loading The Big Bang Theory subset of the TVD dataset
from tvd import TheBigBangTheory
dataset = TheBigBangTheory('/Volumes/data/tvd/')
For illustration purposes, we will only focus on the first episode of the TV series
firstEpisode = dataset.episodes[0]
print firstEpisode
By chance, for this episode, both the english transcript and the subtitles are available in the dataset.
transcript = dataset.get_resource('transcript', firstEpisode)
pyannote.parser provides a subtitle file (.srt) parser.
It also takes care of dividing (split=True) subtitles that cover multiple dialogue lines from several speaker.
Additionally (duration=True), it allottes to each line a duration proportional to their number of words.
from pyannote.parser.srt import SRTParser
parser = SRTParser(split=True, duration=True)
pathToSRTFile = dataset.path_to_subtitles(firstEpisode, language='en')
subtitles = parser.read(pathToSRTFile)
For illustration purposes, let us focus on the first minute of the episode.
subtitles = subtitles.crop(0, 60)
transcript = transcript.crop('C', 'T')
The transcript provides the exact transcription of the dialogues and characters' names.
However, it does not contain any timing information (other than the obvious chronological order).
transcript
On the other hand, subtitles only provide coarse dialogue transcription.
However, they do provide raw timing information.
subtitles
Our objective here is to combine those two sources of information in order to get as close as possible to the actual reference annotation:
reference = dataset.get_resource('speaker', firstEpisode)
# 'speaker' resources contains plenty of labels (e.g music, applause, silence, etc...)
# we are only going to use the ones starting by 'speech_'
# (e.g. speech_sheldon, speech_penny, speech_raj, etc...)
reference = reference.subset(set([label for label in reference.labels() if label.startswith('speech')]))
# next, for better display below, we change each 'speech_somebody' label in 'SOMEBODY'
# (e.g speech_sheldon becomes SHELDON)
reference = reference.translate({label: label[7:].upper() for label in reference.labels()})
# now, we just keep the first minute for illustration purposes
reference = reference.crop(Segment(0,60))
# this should display nicely in IPython Notebook
reference
We are now going to use dynamic time warping to align the transcript with the subtitles based on the actual text.
from pyannote.features.text.preprocessing import TextPreProcessing
from pyannote.features.text.tfidf import TFIDF
# this is the default text (i.e. subtitles or speech transcript) pre-processing
preprocessing = TextPreProcessing(
tokenize=True, # step 1: tokenization of sentences into words
lemmatize=True, # step 2: lemmatization
stem=True, # step 3: stemming
stopwords=True, # step 4: remove stop-words
pos_tag=True,
keep_pos=True, # step 5: only keep nouns, adjectives, verbs and adverbs
min_length=2) # step 6: remove stems shorter than 2 letters
# this is the TF-IDF transformer that will project
# each pre-processed text into a fixed-size vector
tfidf = TFIDF(preprocessing=preprocessing, # use just-defined pre-processing
binary=True) # use binary term-frequency
The actual alignment is done by an instance of TFIDFAlignment (more details on what it does below).
from pyannote.algorithms.alignment.transcription import TFIDFAlignment
aligner = TFIDFAlignment(tfidf, adapt=True)
merged = aligner(transcript, subtitles, vattribute='speech', hattribute='subtitle')
Bam! Merged!
merged
We end up with a timestamped transcript which could, in turn, be used to train acoustic speaker identification models.
Let us quickly convert it into an Annotation object that can be easily visualize:
from pyannote.core import Annotation
hypothesis = Annotation()
for start_time, end_time, data in merged.edges_iter(data=True):
if 'speaker' in data:
hypothesis[Segment(start_time, end_time)] = data['speaker']
hypothesis
See? We managed to obtain a speaker annotation not that far away from the actual groundtruth:
reference
from pyannote.algorithms.alignment.dtw import DynamicTimeWarping
Under the hood, each dialogue line is first changed into a TF-IDF vector
text = 'If a photon is directed through a plane with two slits in it...'
preprocessing(text)
vector = tfidf.transform([text])[0,:].toarray().squeeze()
print vector
vector has non-zero value only for dimensions corresponding to its words.
for i, value in enumerate(vector):
if value > 0: # only print non-zero values
print tfidf._cv.get_feature_names()[i], value
Next, cosine distance between all pairs of transcript and subtitle lines is computed.
vsequence = aligner._get_sequence(transcript, 'speech')
hsequence = aligner._get_sequence(subtitles, 'subtitle')
distance = aligner.pairwise_distance(vsequence, hsequence)
im = imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();
Dynamic time warping ends the process by looking for the path with minimum overall cost (in white below).
dtw = DynamicTimeWarping()
path = dtw(range(len(vsequence)), range(len(hsequence)), distance=distance)
for i, j in path:
distance[i, j] = np.NaN
imshow(distance, interpolation='nearest', aspect='auto'); xlabel('Subtitles'); ylabel('Transcript'); colorbar();