Skip to content

Feature Request: Automatic Subtitle Decoding to Human-Readable Text #1770

@alohays

Description

@alohays

Currently, PyAV exposes subtitle streams as SubtitleSet objects. While this low-level access is very useful, there is no built-in high-level API to automatically decode these subtitles into a human-readable text format (e.g., SRT or ASS). This feature would greatly simplify workflows for users who need to extract and process subtitle text from container files.

Use Case:

  • I work with container formats like MKV that include embedded subtitles.
  • I need to extract these subtitle streams and convert them into a standard text format (such as SRT) for further processing (e.g., for transcription, translation, or overlaying onto videos).
  • Currently, I must manually parse the raw subtitle data from the SubtitleSet objects, which is error-prone and cumbersome.

Rationale:

  • Ease of Use: Automating the conversion of raw subtitle data into a readable format would help reduce boilerplate code and simplify many common subtitle processing tasks.
  • Wider Adoption: Many users coming from multimedia processing backgrounds expect a higher-level API for subtitle handling, similar to what FFmpeg’s CLI offers.
  • Incremental Implementation: Even if full support for all subtitle formats isn’t feasible immediately, a partial implementation that covers the most common text-based formats (like SRT and ASS) would be very beneficial.

Potential Implementation Ideas:

  • Introduce a method (e.g., SubtitleSet.decode_text()) that processes the raw subtitle packets and returns the subtitle text.
  • Allow the method to either return the text as a string or write it directly to a file.
  • Optionally support parameters that let users choose the output format, handling details like timing, formatting, and styling where applicable.

Questions & Discussion:

  • Is automatic subtitle decoding considered within the intended scope of PyAV?
  • Are there known technical challenges or design philosophies that would advise against adding such a feature?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions