Extracting Subtitles From Containers

The title sounds weird out of context, but for about a month or so I’ve been tinkering with my own media server. I had a 2011 PC I didn’t use anymore so I decided to repurpose it.

What I had

  • a 10 year old nice rubber-padded case which got really sticky and nasty over time (gross)
  • LGA1155 budget motherboard from 2012-2013
  • Intel 2nd Gen i5-2500K processor, which is probably the only salvageable part
  • 2 sticks of Corsair RAM of which only one works, so 2 GB of DDR3 memory
  • 3 TB WD RED hard drive which I bought a couple of years back
  • 240 GB Intel SSD which I use for the operating system

Obviously this didn’t cut it at all, so I had to change things up a bit.

What I got

  • 8 GB of the cheapest DDR3 RAM I could find
  • a couple of 4TB HDDs (WD RED); one was faulty, got it replaced
  • a cool future-proof case with 8 3.5” HDD bays
  • a microATX motherboard, since I was dumb and didn’t check if the case actually supported ATX boards like my existing one (hint: it doesn’t)
  • a PCI SATA controller as the amount of SATA ports on the motherboard didn’t suffice

What I did

After the gruesome1 process of assembling everything, I remembered I don’t actually have a monitor at my old place. However, I thought, this wasn’t my first rodeo with headless OS installs (thinks about Raspberry Pi). Since I just wanted a simple Debian installation, I thought this would be just as simple.

It wasn’t. I spent hours searching for a simple way, convinced that I didn’t want to read hundreds of pages of Debian manuals and documentation to learn about preseeding and related concepts. Finally, I found the Holy Grail: Ciborski’s guide to remote Debian installation over SSH. Basically this Polish guy went through all this shit so I didn’t have to. I flashed the ISO to a USB flash drive, plugged it in and smooth sailing from then on.

After installing Debian and OpenMediaVault (media-server application/overlay with a nice web interface), I began preparing my filesystems. I pooled my new 4TB drives and my old 3TB drive together with mergerfs, so I’d have a huge amount of space under a single apparent mountpoint. As the amount of hard drives is limited, I decided not to opt for parity right now. Should I choose so, probably when I increase the amount of HDDs, I would use SnapRAID.

OK, enough about this. How do I extract my damn subtitles?

This should be enough for context. Now, to get to the problem at hand.

To serve media from the filesystem I use Plex Media Server which I’m running inside a Docker container. It’s not open source, but it’s the best supported media server software on the market, so let’s leave it at that.

I pride myself with my English proficiency, I can decently understand the weirdest of accents, but still I like to watch my movies with English subtitles. Plex likes text format subtitles the best, mainly SRT files, separate from the media file.

However, during the process of legally acquiring films to watch, you might run into more particular kinds of subtitles. When you rip your DVDs or Blu-rays you find that you get SUB/IDX pairs of files as subtitles, or SUP files, which are actually image representations of the subtitles, which have to be OCR-ed to get a compliant SRT file.

Case Study 1: ASS to SRT

Dependencies: mkvtoolnix, ffmpeg

ASS stands for Advanced SSA which stands for Sub Station Alpha which is a subtitle format generally used by fansubbers, so you might run into it when watching animes.

This is actually a text format so converting it to SRT is trivial. The only issue was that the ASS subtitle track was embedded into the MKV, so it had to be extracted first.

  • Check MKV info and locate subtitle track:
$ mkvmerge -i Cool_Anime_Episode_1.mkv
File 'Cool_Anime_Episode_1.mkv': container: Matroska
Track ID 0: video (MPEG-4p10/AVC/h.264)
Track ID 1: audio (MP3)
Track ID 2: audio (AAC)
Track ID 3: audio (AAC)
Track ID 4: subtitles (SubStationAlpha)
Attachment ID 1: type 'application/vnd.ms-opentype', size 14412 bytes, file name 'NeoSans.otf'
Attachment ID 2: type 'application/vnd.ms-opentype', size 174244 bytes, file name 'rough_typewriter.otf'
Chapters: 6 entries
  • Extract track from MKV:
$ mkvextract tracks Cool_Anime_Episode_1.mkv 4:Cool_Anime_Episode_1.ass
Extracting track 4 with the CodecID 'S_TEXT/ASS' to the file 'Cool_Anime_Episode_1.ass'. Container format: SSA/ASS text subtitles
Progress: 100%
  • Convert extracted subtitle to SRT
$ ffmpeg -i Cool_Anime_Episode_1.ass -codec:s text Cool_Anime_Episode_1.srt
  • OK, cool, but Cool Anime has like 30 episodes, do I have to do this manually for each episode?
# batch extract
$ for file in `find . -name '*.mkv'`; do mkvextract tracks $file 4:${file::-4}.ass; done

# batch convert
$ for file in `find . -name '*.ass'`; do ffmpeg -i $file ${file::-4}.srt; done
  • OK, cool, but my filenames contain spaces…
# batch extract
$ IFS=$'\n' && for file in `find . -name '*.mkv'`; do mkvextract tracks "$file" 4:"${file::-4}.ass"; done

# batch convert
$ IFS=$'\n' && for file in `find . -name '*.ass'`; do ffmpeg -i "$file" "${file::-4}.srt"; done

Case Study 2: SUB/IDX to SRT

Dependencies: vobsub2srt

This is what you get after ripping DVDs. SUB files generally contain image representations of each subtitle, and the IDX file controls timestamps. For this case the solution is even easier.

Suppose your structure is similar to this:

/movies/The.Guest.1963.1080p.BluRay.x264-GHOULS/Subs$ ls
the.guest.1963.1080p.bluray.x264-ghouls.idx  the.guest.1963.1080p.bluray.x264-ghouls.sub

You need to feed vobsub2srt the filename (without extension):

/movies/The.Guest.1963.1080p.BluRay.x264-GHOULS/Subs$ vobsub2srt the.guest.1963.1080p.bluray.x264-ghouls
Wrote Subtitles to 'the.guest.1963.1080p.bluray.x264-ghouls.srt'

Since this is an image format, the process might crap out and give you OCR errors like | or l (lowercase L) instead of I (uppercase i), that’s when the --blacklist option comes in, which helps you filter those characters, (i.e. vobsub2srt --blacklist "\/|<>" filename).

Case Study 3: PGS/SUP to SRT

Dependencies: bdsup2sub, vobsub2srt

This is another image format, commonly seen in Blu-ray rips. Once again, we scan for tracks in the MKV:

$ mkvmerge -i The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.mkv
File 'The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.mkv': container: Matroska
Track ID 0: video (MPEG-4p10/AVC/h.264)
Track ID 1: audio (DTS-HD Master Audio)
Track ID 2: subtitles (HDMV PGS)

Extract the required track:

$ mkvextract tracks The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.mkv 3:The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.sup
Extracting track 3 with the CodecID 'S_HDMV/PGS' to the file 'The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.sup'. Container format: SUP
Progress: 100%

After that, you run bdsup2sub on the extracted file to get IDX/SUB files:

$ bdsup2sub -o The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.sub -i The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.sup

Finally, vobsub2srt takes the SUB/IDX files and turns then into the final SRT.

$ vobsub2srt The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng
Wrote Subtitles to 'The.Rocky.Horror.Picture.Show.1975.1792x1080.BDRip.x264.DTS-HD.MA.Eng.srt'

Conclusions

This is it, basically. Since these are all command line tools, they can be easily scripted and automated whenever you add new files or you have a boatload of files on which you want to execute the same things.


  1. Multiple iterations were required. First I got the case and tried to mash my existing motherboard into it. It didn’t fit. For some reason I kept thinking my motherboard model was GA-B75M-D3V. It wasn’t. My model was GA-B75-D3V, without the M, which stood for micro, which is the type of motherboard that fits in the case. So I wasn’t wrong about everything. One week later, when I got a new and compatible motherboard, one of the freshly bought HDDs failed. So let’s say this wasn’t a smooth process, at all. ↩︎