RFC: What should we do about overlapping subtitles? #60

emk · 2024-05-04T20:09:59Z

The core substudy algorithms are all designed around non-overlapping subtitles. There's a built-in "cleaning" layer that will fix small overlaps as best as it can. But a few SRT files use partially overlapping subs to convey semantic and timing information, and other SRT files contain lots of garbage data.

What should we do here? Major options include:

Try a few simple things to produce non-overlapping subs, and if none of those work, try to issue a good error. This is the approach we took in Error: Cannot truncate time period Period { begin: 453.57, end: 457.84 } at 453.57 #37. We could try to improve the "cleaning" algorithm to handle more cases, if we know what people are regularly encountering.
Automatically combine subs with non-trivial overlap into one giant combined subtitle. This is tricky, especially with certain Whisper output, which will often produce a 30-second segment overlapping many shorter segments.
Redesign all our algorithms and UI ideas to handle overlapping subtitles.

I am honestly not too interested in pursuing (3) if I can possibly get good results (for most use cases) without it. But (1) vs (2) is a harder tradeoff and I'd love feeback on what people are encountering in their SRT files.

CC @aaron-meyers

aaron-meyers · 2024-05-16T23:55:27Z

The main concerns I would have with either 1 or 2 is that a lot of videos legitimately have overlapping subtitles, because there are multiple speakers simultaneously (e.g. a TV broadcaster in the background while another character is speaking). In some cases, the 'secondary' subtitle has some unique formatting that could be used to identify it and then treat it essentially as a separate track, but this would need to be detected per file (or implement a bunch of common patterns). For example, in Japanese, Netflix will generally display one subtitle on the bottom (like normal) and a secondary subtitle on the right (vertically). In English I've seen italic used for the secondary subtitle or even different colors (in .ass subtitles).

I haven't looked at your alignment algorithm and I haven't actually tried to implement one myself yet. I was going to start with something pretty simple - iterating over the native (base) subtitle items and aligning the reference subtitles when they have > some % overlap with the native subtitle item (maybe 90%+) by default, with a more relaxed match if there aren't any overlapping subtitles in each track. This is probably naive though 😅

emk added enhancement under consideration labels May 4, 2024

emk mentioned this issue May 4, 2024

Preserve line breaks in multiline subtitles #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: What should we do about overlapping subtitles? #60

RFC: What should we do about overlapping subtitles? #60

emk commented May 4, 2024

aaron-meyers commented May 16, 2024

RFC: What should we do about overlapping subtitles? #60

RFC: What should we do about overlapping subtitles? #60

Comments

emk commented May 4, 2024

aaron-meyers commented May 16, 2024