Skip to content

Automatically synchronize subtitles with audio using machine learning

License

Notifications You must be signed in to change notification settings

oseiskar/autosubsync

Repository files navigation

Automatic subtitle synchronization tool

PyPI

Did you know that hundreds of movies, especially from the 1950s and '60s, are now in public domain and available online? Great! Let's download Plan 9 from Outer Space. As a non-native English speaker, I prefer watching movies with subtitles, which can also be found online for free. However, sometimes there is a problem: the subtitles are not in sync with the movie.

But fear not. This tool can resynchronize the subtitles without any human input. A correction for both shift and playing speed can be found automatically... using "AI & machine learning"

Installation

macOS / OSX

Prerequisites: Install Homebrew and pip. Then install FFmpeg and this package

brew install ffmpeg
pip install autosubsync

Linux (Debian & Ubuntu)

Make sure you have Pip, e.g., sudo apt-get install python-pip. Then install FFmpeg and this package

sudo apt install ffmpeg
sudo apt install libsndfile1 # sometimes optional
sudo pip install autosubsync

The libsndfile1 is sometimes but not always needed due to bastibe/python-soundfile#258.

Usage

autosubsync [input movie] [input subtitles] [output subs]

# for example
autosubsync plan-9-from-outer-space.avi \
  plan-9-out-of-sync-subs.srt \
  plan-9-subtitles-synced.srt

See autosubsync --help for more details.

Features

  • Automatic speed and shift correction

  • Typical synchronization accuracy ~0.15 seconds (see performance)

  • Wide video format support through ffmpeg

  • Supports all reasonably encoded SRT files in any language

  • Should work with any language in the audio (only tested with a few though)

  • Quality-of-fit metric for checking sync success

  • Python API. Example (save as batch_sync.py):

    "Batch synchronize video files in a folder: python batch_sync.py /path/to/folder"
    
    import autosubsync
    import glob, os, sys
    
    if __name__ == '__main__':
        for video_file in glob.glob(os.path.join(sys.argv[1], '*.mp4')):
            base = video_file.rpartition('.')[0]
            srt_file = base + '.srt'
            synced_srt_file = base + '_synced.srt'
    
            # see help(autosubsync.synchronize) for more details
            autosubsync.synchronize(video_file, srt_file, synced_srt_file)

Development

Training the model

  1. Collect a bunch of well-synchronized video and subtitle files and put them in a file called training/sources.csv (see training/sources.csv.example)
  2. Run (and see) train_and_test.sh. This
    • populates the training/data folder
    • creates trained-model.bin
    • runs cross-validation

Synchronization (predict)

Assumes trained model is available as trained-model.bin

python3 autosubsync/main.py input-video-file input-subs.srt synced-subs.srt

Build and distribution

  • Create virtualenv: python3 -m venv venvs/test-python3
  • Activate venv: source venvs/test-python3/bin/activate
  • pip install -e .
  • pip install wheel
  • python setup.py bdist_wheel

Methods

The basic idea is to first detect speech on the audio track, that is, for each point in time, t, in the film, to estimate if speech is heard. The method described below produces this estimate as a probability of speech p(t). Another input to the program is the unsynchronized subtitle file containing the timestamps of the actual subtitle intervals.

Synchronization is done by finding a time transformation tf(t) that makes s(f(t)), the synchronized subtitles, best match, p(t), the detected speech. Here s(t) is the (unsynchronized) subtitle indicator function whose value is 1 if any subtitles are visible at time t and 0 otherwise.

Speech detection (VAD)

Speech detection is done by first computing a spectrogram of the audio, that is, a matrix of features, where each column corresponds to a frame of duration Δt and each row a certain frequency band. Additional features are engineered by computing a rolling maximum of the spectrogram with a few different periods.

Using a collection of correctly synchronized media files, one can create a training data set, where the each feature column is associated with a correct label. This allows training a machine learning model to predict the labels, that is, detect speech, on any previously unseen audio track - as the probability of speech p(iΔt) on frame number i.

The weapon of choice in this project is logistic regression, a common baseline method in machine learning, which is simple to implement. The accuracy of speech detection achieved with this model is not very good, only around 72% (AURoC). However, the speech detection results are not the final output of this program but just an input to the synchronization parameter search. As mentioned in the performance section, the overall synchronization accuracy is quite fine even though the speech detection is not.

Synchronization parameter search

This program only searches for linear transformations of the form f(t) = a t + b, where b is shift and a is speed correction. The optimization method is brute force grid search where b is limited to a certain range and a is one of the common skew factors. The parameters minimizing the loss function are selected.

Loss function

The data produced by the speech detection phase is a vector representing the speech probabilities in frames of duration Δt. The metric used for evaluating match quality is expected linear loss:

    loss(f) = Σi s(fi) (1 - pi) + (1 - s(fi)) pi,

where pi = p(iΔt) is the probability of speech and s(fi) = s(f(iΔt)) = s(a iΔt + b) is the subtitle indicator resynchronized using the transformation f at frame number i.

Speed correction

Speed/skew detection is based on the assumption that an error in playing speed is not an arbitrary number but caused by frame rate mismatch, which constraints the possible playing speed multiplier to be ratio of two common frame rates sufficiently close to one. In particular, it must be one of the following values

  • 24/23.976 = 30/29.97 = 60/59.94 = 1001/1000
  • 25/24
  • 25/23.976

or the reciprocal (1/x).

The reasoning behind this is that if the frame rate of (digital) video footage needs to be changed and the target and source frame rates are close enough, the conversion is often done by skipping any re-sampling and just changing the nominal frame rate. This effectively changes the playing speed of the video and the pitch of the audio by a small factor which is the ratio of these frame rates.

Performance

Based on somewhat limited testing, the typical shift error in auto-synchronization seems to be around 0.15 seconds (cross-validation RMSE) and generally below 0.5 seconds. In other words, it seems to work well enough in most cases but could be better. Speed correction errors did not occur.

Auto-syncing a full-length movie currently takes about 3 minutes and utilizes around 1.5 GB of RAM.

References

I first checked Google if someone had already tried to solve the same problem and found this great blog post whose author had implemented a solution using more or less the same approach that I had in mind. The post also included good points that I had not realized, such as using correctly synchronized subtitles as training data for speech detection.

Instead of starting from the code linked in that blog post I decided to implement my own version from scratch, since this might have been a good application for trying out RNNs, which turned out to be unnecessary, but this was a nice project nevertheless.

Other similar projects