Subtitle Encoding: Handling Character Sets Right

Posted on December 4, 2024 by SubZap7 min read

  • 🔧 Technical
  • 🔍 Debugging
  • 🌐 Web

Ever seen subtitles turn into gibberish like this?

Original:  Hello, 你好, Привет

This is an encoding problem - and the tricky part is that most subtitle editors won't even show you what's wrong. While they're great for timing and formatting, they often hide or mishandle encoding issues. Sometimes you need to break out the developer tools.

The Encoding Problem

Text files aren't just text - they're sequences of bytes that need to be interpreted correctly. Different encoding systems map these bytes to different characters. When a file is read with the wrong encoding, you get mojibake - that garbled text you saw above.

Here's what happens behind the scenes:

Text: 你
UTF-8 bytes: E4 BD A0

This is particularly common with subtitles because they:

  • Often contain multiple languages
  • Get shared across different platforms
  • Come from various editing tools
  • Might be very old files

Common Encoding Issues

Mixed Encodings

Sometimes a single file contains multiple encodings. This usually happens when:

  • Copying text from different sources
  • Editing with different tools
  • Converting files incorrectly

Example of mixed encoding:

1
00:00:01,000 --> 00:00:04,000
This is fine
你好 # <-- UTF-8

Byte Order Marks (BOM)

The BOM is a special marker at the start of a file that indicates its encoding. Some systems require it, others reject it:

  • Windows Notepad expects it
  • Many Unix tools reject it
  • Some players ignore it entirely

Platform Assumptions

Different systems make different assumptions:

  • Windows often defaults to Windows-1252
  • macOS typically assumes UTF-8
  • Older systems might use ISO-8859-1
  • Web platforms usually expect UTF-8

Working with Encodings

Detecting the Current Encoding

Most code editors can detect encodings automatically:

  • Notepad++: "Encoding" menu shows current encoding
  • VSCode: Bottom right corner shows encoding
  • Command line: file -i filename.srt

Converting Between Encodings

First, always make a backup before converting files - then, to convert:

Using a text editor, like Notepad++:

  1. Encoding -> Convert to UTF-8
  2. Save the file

Using command line tooling:

Testing Across Platforms

Make sure to always test any converted files using multiple players, verifying that all special characters (non-ASCII) are displayed correctly. Test on target platforms when needed to ensure maximum compatibility.

Why UTF-8 Is the Answer

UTF-8 has become the standard for good reasons:

  • Supports all Unicode characters (including emoji 😊)
  • Backward compatible with ASCII & efficient storage (ASCII characters use just one byte)
  • Default in modern systems with no platform-specific quirks
  • Web-friendly

We suggest using UTF-8 for all new subtitle files, and converting legacy files to UTF-8 when needed. This will also increase compatibility with the subtitling tools we provide here at SubZap.

Tools and Commands

While subtitle editors are great for most tasks, encoding issues often require different tools:

Code Editors

Notepad++

This editor shows encoding in the status bar, and can convert between encodings. It also detects encoding automatically.

VSCode

VSCode is the most popular code editor, and as such has excellent encoding detection. It also has built-in hex viewer and encoding selection right in the status bar.

Sublime Text

Another, older code editor that has excellent encoding support. It also has hex viewing capabilities and batch processing.

Command Line Tools

# Detect encoding
file -i subtitle.srt

# Convert to UTF-8
iconv -f ISO-8859-1 -t UTF-8 input.srt > output.srt

# Check for encoding issues

Validation Tools

  • SubtitleEdit: Has encoding detection
  • ffmpeg: Can check subtitle encoding
  • Online validators: Various web tools

Best Practices

  • When creating new files: Set your editor's default to UTF-8 and save files with UTF-8 encoding - verify encoding after saving.
  • When converting legacy files: Make backups before converting files, and test after conversion to ensure all special characters (non-ASCII) are displayed correctly. Verify that all files are encoded in UTF-8 and that no mixed encodings are present.
  • When validating files: Use multiple tools to verify, test with target players, check all special characters, and verify line endings.
  • When working with multiple platforms: Test on target platforms (web, mobile) when needed to ensure maximum compatibility.

What's Next?

Now that you understand encoding, you're ready to tackle more advanced subtitle formats. In our next article, we'll explore SSA/ASS subtitles, where proper encoding is crucial for advanced styling and positioning.